Skip to content

fix: improve Prometheus metrics naming conventions per best practices#2

Merged
SammyLin merged 1 commit into
masterfrom
fix/prometheus-metrics-naming-conventions-v2
Mar 6, 2026
Merged

fix: improve Prometheus metrics naming conventions per best practices#2
SammyLin merged 1 commit into
masterfrom
fix/prometheus-metrics-naming-conventions-v2

Conversation

@the3mi
Copy link
Copy Markdown
Collaborator

@the3mi the3mi commented Mar 6, 2026

Summary

This PR improves Prometheus metric naming across cron.go, agent.go, and workspace.go to align with Prometheus naming best practices.


Changes Made

A) collector/cron.go — Cron Job Metrics

❌ Removed: openclaw_cron_job_last_run_age_seconds

Reason: Prometheus best practices recommend exposing Unix timestamps (which are stable and can be used to compute age at query time) rather than pre-computed age values that become stale between scrapes. The existing openclaw_cron_job_last_run_at_seconds already provides the Unix timestamp, making the age metric redundant.

Migration: Use time() - openclaw_cron_job_last_run_at_seconds in PromQL.

❌ Removed: openclaw_cron_job_next_run_in_seconds

Reason: Same rationale — pre-computed countdown values are redundant when openclaw_cron_job_next_run_at_seconds (Unix timestamp) is already available.

Migration: Use openclaw_cron_job_next_run_at_seconds - time() in PromQL.

🔄 Renamed: openclaw_cron_job_last_duration_msopenclaw_cron_job_last_duration_seconds

Reason: Prometheus convention is to use base units (seconds for time, bytes for data). Millisecond suffixes are non-standard and require consumers to mentally convert. The value is now divided by 1000 to convert from ms to seconds.


B) collector/agent.go — Agent Metrics

🔄 Renamed: openclaw_agent_last_activity_secondsopenclaw_agent_last_activity_timestamp_seconds

Reason: The metric was previously an age (seconds since last activity), which is an unstable value that drifts between scrapes. Changed to export a Unix timestamp instead (via time.Now().Unix() - secondsAgo). The _timestamp_seconds suffix is the Prometheus-recommended convention for Unix timestamps (used by e.g. process_start_time_seconds).

Migration: Use time() - openclaw_agent_last_activity_timestamp_seconds in PromQL to get seconds since last activity.


C) collector/workspace.go — Workspace Metrics

🔄 Renamed: openclaw_md_workspace_total_bytesopenclaw_md_workspace_bytes

Reason: The total_ prefix is redundant when the metric already represents an aggregate value per workspace label. Prometheus naming guidelines discourage redundant words in metric names.

🔄 Renamed: openclaw_md_workspace_total_tokens_estimatedopenclaw_md_workspace_tokens_estimated

Reason: Same rationale as above — total_ prefix dropped.


Files Updated

  • collector/cron.go — Removed 2 redundant metrics, renamed duration metric, removed unused time import
  • collector/agent.go — Renamed metric, changed value to Unix timestamp
  • collector/workspace.go — Renamed 2 workspace aggregate metrics
  • README.md — Updated metric table
  • README.zh-TW.md — Updated metric table (Traditional Chinese)
  • deploy/grafana/dashboards/openclaw-complete.json — Updated agent activity query to time() - openclaw_agent_last_activity_timestamp_seconds
  • deploy/grafana/dashboards/token-usage.json — Updated workspace tokens query to new metric name

⚠️ Breaking Changes

Old Metric Status Replacement / Migration
openclaw_cron_job_last_run_age_seconds Removed time() - openclaw_cron_job_last_run_at_seconds
openclaw_cron_job_next_run_in_seconds Removed openclaw_cron_job_next_run_at_seconds - time()
openclaw_cron_job_last_duration_ms Renamed openclaw_cron_job_last_duration_seconds (value ÷ 1000)
openclaw_agent_last_activity_seconds Renamed + type change time() - openclaw_agent_last_activity_timestamp_seconds
openclaw_md_workspace_total_bytes Renamed openclaw_md_workspace_bytes
openclaw_md_workspace_total_tokens_estimated Renamed openclaw_md_workspace_tokens_estimated

PromQL Migration Examples

# Seconds since a cron job last ran (was: openclaw_cron_job_last_run_age_seconds)
time() - openclaw_cron_job_last_run_at_seconds

# Seconds until next cron run (was: openclaw_cron_job_next_run_in_seconds)
openclaw_cron_job_next_run_at_seconds - time()

# Last cron job duration in seconds (was: openclaw_cron_job_last_duration_ms / 1000)
openclaw_cron_job_last_duration_seconds

# Seconds since agent last active (was: openclaw_agent_last_activity_seconds)
time() - openclaw_agent_last_activity_timestamp_seconds

# Total workspace bytes (was: openclaw_md_workspace_total_bytes)
openclaw_md_workspace_bytes

# Total workspace estimated tokens (was: openclaw_md_workspace_total_tokens_estimated)
openclaw_md_workspace_tokens_estimated

@SammyLin SammyLin merged commit e82ea9d into master Mar 6, 2026
3 checks passed
@SammyLin SammyLin deleted the fix/prometheus-metrics-naming-conventions-v2 branch March 6, 2026 07:04
Repository owner deleted a comment from the3mi Mar 6, 2026
Repository owner deleted a comment from the3mi Mar 6, 2026
Comment thread collector/agent.go
ch <- prometheus.MustNewConstMetric(agentSessionsDesc, prometheus.GaugeValue, float64(sessions), name)
ch <- prometheus.MustNewConstMetric(agentStateDesc, prometheus.GaugeValue, StateMap[state], name)
ch <- prometheus.MustNewConstMetric(agentLastActivityDesc, prometheus.GaugeValue, secondsAgo, name)
ch <- prometheus.MustNewConstMetric(agentLastActivityTimestampDesc, prometheus.GaugeValue, float64(time.Now().Unix())-secondsAgo, name)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In getAgentState you first do secondsAgo := time.Since(latest.modTime).Seconds(), and then you convert it back again. Maybe even cleaner to just return the timestamp to begin with?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Fixed in v0.4.1 — now returns the timestamp directly instead of the roundtrip conversion. https://github.com/SammyLin/openclaw-exporter/releases/tag/v0.4.1

Repository owner deleted a comment from the3mi Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants