feat: Measure system errors using a counter by morgan-wowk · Pull Request #127 · TangleML/tangle

morgan-wowk · 2026-02-25T13:01:31Z

TL;DR

Added OpenTelemetry metrics instrumentation to track execution system errors in the orchestrator.

Business value

We will be able to track the rate of system errors and respond to high or increasing rates.

Future iterations

In the future we will emit metrics for state transitions in general, and use measurement attributes to create dimensions on status, then we will have the option to deprecate to deprecate this measurement specific to system errors.

What changed?

Created a new metrics module (cloud_pipelines_backend/instrumentation/metrics.py) that defines an orchestrator meter and an execution_system_errors counter instrument
Integrated the metrics counter into the record_system_error_exception function in orchestrator_sql.py to increment the counter when system errors occur
Added comprehensive OpenTelemetry strategy documentation (otel_strategy.md) covering best practices for meters, instruments, temporality, and aggregation

How to test?

Cherry pick the OTel stack from https://app.graphite.com/github/pr/TangleML/tangle/101/feat-Add-development-observability-stack-for-testing-OpenTelemetry-with-Jaeger%2C-and-Prometheus
Run the stack (docker-compose up -d)
Run export TANGLE_OTEL_TRACE_EXPORTER_ENDPOINT="http://localhost:4317" && export TANGLE_OTEL_TRACE_EXPORTER_PROTOCOL="grpc" && export TANGLE_OTEL_METRIC_EXPORTER_ENDPOINT="http://localhost:4317" && export TANGLE_OTEL_METRIC_EXPORTER_PROTOCOL="grpc"
Trigger execution nodes that result in SYSTEM_ERROR status and verify that the execution.system_errors metric is incremented in your OpenTelemetry metrics backend. Add a line raise RuntimeError("Temporary") to the start of https://github.com/TangleML/tangle/blob/system-error-metric/cloud_pipelines_backend/orchestrator_sql.py#L209

Why make this change?

This enables monitoring and alerting on system errors in pipeline executions, providing better observability into the health and reliability of the orchestrator component. The metrics follow OpenTelemetry semantic conventions and provide a foundation for expanding observability coverage across the application.

morgan-wowk · 2026-02-25T13:01:39Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

cloud_pipelines_backend/instrumentation/metrics.py

Ark-kun · 2026-03-05T05:43:31Z

cloud_pipelines_backend/orchestrator_sql.py



 def record_system_error_exception(execution: bts.ExecutionNode, exception: Exception):
+    app_metrics.execution_system_errors.add(1)


Should we increment system error counter or would it be better to record a system error event?

In terms of Metrics (counter, histogram, gauge), a counter is the most applicable here. However, considering it is an error we could report an error to Observe (they mimic "bugnsag") in the future, with extra details on the error.

morgan-wowk · 2026-03-05T21:25:50Z

Merge activity

Mar 5, 9:25 PM UTC: A user started a stack merge that includes this pull request via Graphite.
Mar 5, 9:27 PM UTC: Graphite couldn't merge this pull request because a downstack PR feat: Set service version on OTel data #124 failed to merge.
Mar 5, 9:43 PM UTC: A user started a stack merge that includes this pull request via Graphite.
Mar 5, 9:48 PM UTC: Graphite rebased this pull request as part of a merge.
Mar 5, 9:49 PM UTC: @morgan-wowk merged this pull request with Graphite.

Made-with: Cursor

This was referenced Feb 25, 2026

chore: Refactor OTel concerns in preparation for metric provider #123

Merged

feat: Set service version on OTel data #124

Merged

chore!: Refactor OTel exporters to be signal-specific #125

Merged

morgan-wowk mentioned this pull request Feb 25, 2026

feat: Setup metric provider for manual and auto-instrumented measurements #126

Merged

morgan-wowk self-assigned this Feb 25, 2026

morgan-wowk force-pushed the system-error-metric branch from aa225ff to 7cbcc18 Compare February 25, 2026 13:09

morgan-wowk marked this pull request as ready for review February 25, 2026 19:54

morgan-wowk requested a review from Ark-kun as a code owner February 25, 2026 19:54

morgan-wowk force-pushed the system-error-metric branch from 7cbcc18 to f0bc197 Compare February 26, 2026 00:56

morgan-wowk force-pushed the setup-metric-provider branch from 66474d1 to 06ecffc Compare February 26, 2026 00:56

yuechao-qin requested changes Feb 26, 2026

View reviewed changes

cloud_pipelines_backend/instrumentation/metrics.py Show resolved Hide resolved

morgan-wowk requested a review from yuechao-qin February 26, 2026 02:09

yuechao-qin approved these changes Feb 26, 2026

View reviewed changes

cloud_pipelines_backend/instrumentation/metrics.py Show resolved Hide resolved

morgan-wowk added the observability label Mar 2, 2026 — with Graphite App

morgan-wowk force-pushed the setup-metric-provider branch from 06ecffc to fdeed83 Compare March 5, 2026 00:24

morgan-wowk force-pushed the system-error-metric branch 2 times, most recently from 33e55b3 to 64eb3fb Compare March 5, 2026 00:40

morgan-wowk force-pushed the setup-metric-provider branch from fdeed83 to 57bd9cf Compare March 5, 2026 00:40

Ark-kun reviewed Mar 5, 2026

View reviewed changes

Ark-kun approved these changes Mar 5, 2026

View reviewed changes

morgan-wowk changed the base branch from setup-metric-provider to graphite-base/127 March 5, 2026 21:46

morgan-wowk changed the base branch from graphite-base/127 to master March 5, 2026 21:47

feat: Measure system errors using a counter

3613000

Made-with: Cursor

morgan-wowk force-pushed the system-error-metric branch from 64eb3fb to 3613000 Compare March 5, 2026 21:48

morgan-wowk merged commit 4ada5bb into master Mar 5, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Measure system errors using a counter#127

feat: Measure system errors using a counter#127
morgan-wowk merged 1 commit intomasterfrom
system-error-metric

morgan-wowk commented Feb 25, 2026 •

edited

Loading

Uh oh!

morgan-wowk commented Feb 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Ark-kun Mar 5, 2026

Uh oh!

morgan-wowk Mar 5, 2026

Uh oh!

morgan-wowk commented Mar 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		def record_system_error_exception(execution: bts.ExecutionNode, exception: Exception):
		app_metrics.execution_system_errors.add(1)

Conversation

morgan-wowk commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Business value

Future iterations

What changed?

How to test?

Why make this change?

Uh oh!

morgan-wowk commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ark-kun Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

morgan-wowk Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

morgan-wowk commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

morgan-wowk commented Feb 25, 2026 •

edited

Loading

morgan-wowk commented Feb 25, 2026 •

edited

Loading

morgan-wowk commented Mar 5, 2026 •

edited

Loading