feat: Measure system errors using a counter#127
Merged
morgan-wowk merged 1 commit intomasterfrom Mar 5, 2026
Merged
Conversation
This was referenced Feb 25, 2026
Collaborator
Author
aa225ff to
7cbcc18
Compare
7cbcc18 to
f0bc197
Compare
66474d1 to
06ecffc
Compare
yuechao-qin
requested changes
Feb 26, 2026
yuechao-qin
approved these changes
Feb 26, 2026
06ecffc to
fdeed83
Compare
33e55b3 to
64eb3fb
Compare
fdeed83 to
57bd9cf
Compare
Ark-kun
reviewed
Mar 5, 2026
|
|
||
|
|
||
| def record_system_error_exception(execution: bts.ExecutionNode, exception: Exception): | ||
| app_metrics.execution_system_errors.add(1) |
Contributor
There was a problem hiding this comment.
Should we increment system error counter or would it be better to record a system error event?
Collaborator
Author
There was a problem hiding this comment.
In terms of Metrics (counter, histogram, gauge), a counter is the most applicable here. However, considering it is an error we could report an error to Observe (they mimic "bugnsag") in the future, with extra details on the error.
Ark-kun
approved these changes
Mar 5, 2026
Collaborator
Author
Merge activity
|
Made-with: Cursor
64eb3fb to
3613000
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

TL;DR
Added OpenTelemetry metrics instrumentation to track execution system errors in the orchestrator.
Business value
We will be able to track the rate of system errors and respond to high or increasing rates.
Future iterations
In the future we will emit metrics for state transitions in general, and use measurement attributes to create dimensions on status, then we will have the option to deprecate to deprecate this measurement specific to system errors.
What changed?
cloud_pipelines_backend/instrumentation/metrics.py) that defines an orchestrator meter and anexecution_system_errorscounter instrumentrecord_system_error_exceptionfunction inorchestrator_sql.pyto increment the counter when system errors occurotel_strategy.md) covering best practices for meters, instruments, temporality, and aggregationHow to test?
docker-compose up -d)export TANGLE_OTEL_TRACE_EXPORTER_ENDPOINT="http://localhost:4317" && export TANGLE_OTEL_TRACE_EXPORTER_PROTOCOL="grpc" && export TANGLE_OTEL_METRIC_EXPORTER_ENDPOINT="http://localhost:4317" && export TANGLE_OTEL_METRIC_EXPORTER_PROTOCOL="grpc"execution.system_errorsmetric is incremented in your OpenTelemetry metrics backend. Add a lineraise RuntimeError("Temporary")to the start of https://github.com/TangleML/tangle/blob/system-error-metric/cloud_pipelines_backend/orchestrator_sql.py#L209Why make this change?
This enables monitoring and alerting on system errors in pipeline executions, providing better observability into the health and reliability of the orchestrator component. The metrics follow OpenTelemetry semantic conventions and provide a foundation for expanding observability coverage across the application.