Skip to content

feat: Measure system errors using a counter#127

Merged
morgan-wowk merged 1 commit intomasterfrom
system-error-metric
Mar 5, 2026
Merged

feat: Measure system errors using a counter#127
morgan-wowk merged 1 commit intomasterfrom
system-error-metric

Conversation

@morgan-wowk
Copy link
Collaborator

@morgan-wowk morgan-wowk commented Feb 25, 2026

TL;DR

Added OpenTelemetry metrics instrumentation to track execution system errors in the orchestrator.

Screenshot 2026-02-25 at 4.39.17 AM.png

Screenshot 2026-02-25 at 4.41.10 AM.png

Business value

We will be able to track the rate of system errors and respond to high or increasing rates.

Future iterations

In the future we will emit metrics for state transitions in general, and use measurement attributes to create dimensions on status, then we will have the option to deprecate to deprecate this measurement specific to system errors.

What changed?

  • Created a new metrics module (cloud_pipelines_backend/instrumentation/metrics.py) that defines an orchestrator meter and an execution_system_errors counter instrument
  • Integrated the metrics counter into the record_system_error_exception function in orchestrator_sql.py to increment the counter when system errors occur
  • Added comprehensive OpenTelemetry strategy documentation (otel_strategy.md) covering best practices for meters, instruments, temporality, and aggregation

How to test?

Why make this change?

This enables monitoring and alerting on system errors in pipeline executions, providing better observability into the health and reliability of the orchestrator component. The metrics follow OpenTelemetry semantic conventions and provide a foundation for expanding observability coverage across the application.

Copy link
Collaborator Author

morgan-wowk commented Feb 25, 2026

@morgan-wowk morgan-wowk force-pushed the setup-metric-provider branch from 06ecffc to fdeed83 Compare March 5, 2026 00:24
@morgan-wowk morgan-wowk force-pushed the system-error-metric branch 2 times, most recently from 33e55b3 to 64eb3fb Compare March 5, 2026 00:40
@morgan-wowk morgan-wowk force-pushed the setup-metric-provider branch from fdeed83 to 57bd9cf Compare March 5, 2026 00:40


def record_system_error_exception(execution: bts.ExecutionNode, exception: Exception):
app_metrics.execution_system_errors.add(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we increment system error counter or would it be better to record a system error event?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of Metrics (counter, histogram, gauge), a counter is the most applicable here. However, considering it is an error we could report an error to Observe (they mimic "bugnsag") in the future, with extra details on the error.

Copy link
Collaborator Author

morgan-wowk commented Mar 5, 2026

Merge activity

  • Mar 5, 9:25 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Mar 5, 9:27 PM UTC: Graphite couldn't merge this pull request because a downstack PR feat: Set service version on OTel data #124 failed to merge.
  • Mar 5, 9:43 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Mar 5, 9:48 PM UTC: Graphite rebased this pull request as part of a merge.
  • Mar 5, 9:49 PM UTC: @morgan-wowk merged this pull request with Graphite.

@morgan-wowk morgan-wowk changed the base branch from setup-metric-provider to graphite-base/127 March 5, 2026 21:46
@morgan-wowk morgan-wowk changed the base branch from graphite-base/127 to master March 5, 2026 21:47
@morgan-wowk morgan-wowk force-pushed the system-error-metric branch from 64eb3fb to 3613000 Compare March 5, 2026 21:48
@morgan-wowk morgan-wowk merged commit 4ada5bb into master Mar 5, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants