-
Notifications
You must be signed in to change notification settings - Fork 13
feat: Measure system errors using a counter #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| """ | ||
| Application-level meters and instruments. | ||
|
|
||
| Meters should be named after the software component they represent. | ||
| They should not change over time (avoid using __name__). | ||
|
|
||
| Instruments should be named after the metric they represent. | ||
| First and foremost, they should follow the semantic conventions | ||
| (https://opentelemetry.io/docs/specs/semconv/general/metrics/) | ||
| of OTel if the metric is common (e.g. http.server.duration). | ||
|
|
||
| For custom, application-specific measurements, choose a name after | ||
| what is being measured, and not after the software component that | ||
| measures it. | ||
|
|
||
| Good example: | ||
| - Meter: tangle.orchestrator | ||
| - Instrument: execution.system_errors | ||
|
|
||
| Bad example: | ||
| - Meter: tangle.orchestrator | ||
| - Instrument: orchestrator_execution_system_errors | ||
| """ | ||
|
|
||
| from opentelemetry import metrics as otel_metrics | ||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # tangle.orchestrator | ||
| # --------------------------------------------------------------------------- | ||
| orchestrator_meter = otel_metrics.get_meter("tangle.orchestrator") | ||
|
|
||
| execution_system_errors = orchestrator_meter.create_counter( | ||
| name="execution.system_errors", | ||
| description="Number of execution nodes that ended in SYSTEM_ERROR status", | ||
| unit="{error}", | ||
| ) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,6 +21,7 @@ | |
| from .launchers import common_annotations | ||
| from .launchers import interfaces as launcher_interfaces | ||
| from .instrumentation import contextual_logging | ||
| from .instrumentation import metrics as app_metrics | ||
|
|
||
| _logger = logging.getLogger(__name__) | ||
|
|
||
|
|
@@ -1037,6 +1038,8 @@ def _retry( | |
|
|
||
|
|
||
| def record_system_error_exception(execution: bts.ExecutionNode, exception: Exception): | ||
| app_metrics.execution_system_errors.add(1) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we increment system error counter or would it be better to record a system error event?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In terms of Metrics (counter, histogram, gauge), a counter is the most applicable here. However, considering it is an error we could report an error to Observe (they mimic "bugnsag") in the future, with extra details on the error. |
||
|
|
||
| if execution.extra_data is None: | ||
| execution.extra_data = {} | ||
| execution.extra_data[ | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.