feat(monitoring): add prometheus metrics collection and grafana dashboard#100
feat(monitoring): add prometheus metrics collection and grafana dashboard#100KalpanaBhaskar wants to merge 2 commits intofuzziecoder:mainfrom
Conversation
|
@KalpanaBhaskar is attempting to deploy a commit to the Revon Team on Vercel. A member of the Team first needs to authorize it. |
|
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (8)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces Prometheus monitoring to the pipeline execution system. It adds metric definitions for execution tracking, instruments the executor with timing measurements, exposes metrics via a new Changes
Sequence DiagramsequenceDiagram
participant Client
participant Executor
participant Metrics
participant Prometheus
participant Grafana
Client->>Executor: Start Pipeline Execution
Executor->>Metrics: Increment pipeline_active_executions
Executor->>Metrics: Record pipeline start_time
rect rgba(100, 150, 200, 0.5)
Note over Executor,Metrics: Pipeline Processing
Executor->>Executor: Execute Stages
Executor->>Metrics: Record stage start_time
Executor->>Metrics: Observe stage_execution_duration_seconds
end
alt Pipeline Success
Executor->>Metrics: Increment pipeline_executions_total (status=success)
Executor->>Metrics: Observe pipeline_execution_duration_seconds
Executor->>Metrics: Decrement pipeline_active_executions
else Pipeline Failure
Executor->>Metrics: Increment pipeline_executions_total (status=failed)
Executor->>Metrics: Increment pipeline_failures_total
Executor->>Metrics: Observe pipeline_execution_duration_seconds
Executor->>Metrics: Decrement pipeline_active_executions
end
Client->>Prometheus: GET /metrics
Prometheus->>Metrics: Collect Metrics
Prometheus-->>Client: Return Prometheus Format
Grafana->>Prometheus: Query pipeline_executions_total, pipeline_active_executions, etc.
Prometheus-->>Grafana: Metric Data
Grafana->>Grafana: Render Dashboard Panels
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can use your project's `pylint` configuration to improve the quality of Python code reviews.Add a pylint configuration file to your project to customize how CodeRabbit runs |
|
|
||
| # Record failure metrics | ||
| duration = time.time() - start_time | ||
| pipeline_executions_total.labels(pipeline_id=pipeline.id, status='failed').inc() | ||
| pipeline_failures_total.labels(pipeline_id=pipeline.id).inc() | ||
| pipeline_execution_duration_seconds.labels(pipeline_id=pipeline.id).observe(duration) | ||
| pipeline_active_executions.labels(pipeline_id=pipeline.id).dec() |
There was a problem hiding this comment.
🔴 pipeline_active_executions gauge never decremented if metrics recording in except block raises
The pipeline_active_executions gauge is incremented at backend/core/executor.py:48 (before the try block), but it is only decremented inside the try success path (line 79) or inside the except block (line 98). If any of the metrics calls preceding the .dec() in the except block (lines 95–97) raise an exception, the decrement on line 98 is never reached and the exception propagates out. This causes the gauge to permanently drift upward, reporting phantom active executions. The .dec() call should be placed in a finally block to guarantee it always executes regardless of what happens during metrics recording.
Prompt for agents
In backend/core/executor.py, refactor the execute() method (lines 47-99) so that the pipeline_active_executions gauge decrement is guaranteed to run. Move `pipeline_active_executions.labels(pipeline_id=pipeline.id).dec()` out of both the try and except blocks and into a finally block. The structure should be:
pipeline_active_executions.labels(pipeline_id=pipeline.id).inc()
start_time = time.time()
try:
... (success path, record success metrics but WITHOUT the .dec() call)
except Exception as e:
... (failure path, record failure metrics but WITHOUT the .dec() call)
finally:
pipeline_active_executions.labels(pipeline_id=pipeline.id).dec()
Remove the .dec() calls from both the try block (currently line 79) and the except block (currently line 98).
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/core/executor.py`:
- Around line 47-49: Replace uses of the wall-clock timer time.time() with a
monotonic high-resolution timer time.perf_counter() for all elapsed-time
measurements in backend/core/executor.py: change the start_time assignments
(e.g. the start_time set next to
pipeline_active_executions.labels(pipeline_id=pipeline.id).inc()) and any
subsequent elapsed = time.time() - start_time calculations to use
time.perf_counter() instead, and ensure imports/reference use the perf_counter
symbol; apply the same replacement at the other occurrences noted (the other
start_time/elapsed sites around lines 49, 76, 94, 111, 137, 153) so
histogram.observe calls use perf_counter-based durations.
In `@backend/main.py`:
- Around line 75-79: The /metrics endpoint (prometheus_metrics) is currently
exposed anonymously; when settings.AUTH_ENABLED is true, protect it by adding
the same authentication dependency (auth_dependencies) to the app.get decorator
or route registration and hide it from the public OpenAPI (set
include_in_schema=False) so it requires auth and is not in the public schema;
conditionalize the decorator or route registration on settings.AUTH_ENABLED (or
always attach auth_dependencies when enabled) to ensure prometheus_metrics is
only accessible to authenticated/internal callers.
In `@backend/monitoring/metrics.py`:
- Around line 39-44: The metric uses an unbounded label 'stage_name'
(stage_execution_duration_seconds) which will create infinite Prometheus series;
change the metric to use a bounded dimension such as 'stage_type' or
'stage_index' instead and update all call sites that set labels (references in
executor where stage_execution_duration_seconds.labels(...) is called) to
provide the chosen bounded value (e.g., Stage.type or a numeric stage index).
Leave Stage.name unchanged for logs/traces—emit the free-form name to
logging/tracing code but stop passing it into the metric labels. Ensure you
update the metric declaration (remove 'stage_name' label) and all uses in the
executor (the label-setting calls around the stage timing at the locations that
currently pass stage.name) to match the new label keys.
In `@docs/monitoring/grafana_dashboard.json`:
- Around line 92-94: The legend text is incorrect: the PromQL expression using
histogram_quantile(0.95, ...) produces a 95th percentile (p95), not a 95%
confidence interval; update the legendFormat value (currently "{{pipeline_id}}
(95% CI)") to a correct label such as "{{pipeline_id}} (p95)" or
"{{pipeline_id}} (95th percentile)" so the dashboard accurately reflects the
metric returned by histogram_quantile.
- Around line 24-28: The success-rate PromQL expression for the target using the
metric pipeline_executions_total should be changed to handle idle 5m windows:
replace the current expr string with the pattern that performs the division only
when the denominator sum(rate(pipeline_executions_total[5m])) > 0 and otherwise
returns 0 (use the "and on() (... > 0) or on() vector(0)" post-filter approach).
Locate the "targets" -> "expr" entry containing the current
sum(rate(pipeline_executions_total...)) division and update it to the
recommended guarded expression so the panel yields 0% instead of NaN/N/A when no
executions occur.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9ee64cdf-d109-47b9-bae1-2b24a89d032d
⛔ Files ignored due to path filters (4)
backend/__pycache__/main.cpython-311.pycis excluded by!**/*.pycbackend/api/__pycache__/schemas.cpython-311.pycis excluded by!**/*.pycbackend/core/__pycache__/executor.cpython-311.pycis excluded by!**/*.pycbackend/db/__pycache__/crud.cpython-311.pycis excluded by!**/*.pyc
📒 Files selected for processing (4)
backend/core/executor.pybackend/main.pybackend/monitoring/metrics.pydocs/monitoring/grafana_dashboard.json
| # Track active pipeline start | ||
| pipeline_active_executions.labels(pipeline_id=pipeline.id).inc() | ||
| start_time = time.time() |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Python, what clock is recommended for measuring elapsed durations: time.time()ortime.perf_counter()/time.monotonic()?
💡 Result:
For measuring elapsed durations, prefer a monotonic clock:
- Use
time.perf_counter()(ortime.perf_counter_ns()) for timing/benchmarking elapsed time, because it’s monotonic and uses the highest available resolution clock. [1], [2] - Use
time.monotonic()(ortime.monotonic_ns()) when you specifically want a monotonic elapsed-time clock (also immune to system clock changes), typically with slightly less “benchmark-focused” intent thanperf_counter(). [1], [3] - Avoid
time.time()for durations because it’s “wall clock” time and can jump forwards/backwards if the system time is adjusted (NTP, manual changes), which can break elapsed-time measurements. [1], [3]
Rule of thumb: durations → perf_counter() (most common); timestamps/“seconds since epoch” → time.time(). [1], [3]
Sources:
[1] PEP 418 (monotonic/perf_counter rationale and use) (peps.python.org)
[2] Python time module docs (perf_counter) (docs.python.org)
[3] Python time module docs (clock semantics; monotonic vs realtime) (docs.python.org)
Citations:
- 1: https://peps.python.org/pep-0418/?utm_source=openai
- 2: https://docs.python.org/3/library/time.html?utm_source=openai
- 3: https://docs.python.org/3/library/time.html?utm_source=openai
🏁 Script executed:
find . -name "executor.py" -path "*/core/*" | head -5Repository: fuzziecoder/Flexi-Roaster
Length of output: 131
🏁 Script executed:
head -160 backend/core/executor.py | cat -nRepository: fuzziecoder/Flexi-Roaster
Length of output: 7403
🏁 Script executed:
tail -n +160 backend/core/executor.py | grep -n "time\." | head -20Repository: fuzziecoder/Flexi-Roaster
Length of output: 51
Use a monotonic clock for elapsed-time metrics.
time.time() tracks wall time. NTP or manual clock changes can move it backward or forward and skew these histogram observations. time.perf_counter() is the recommended approach for measuring elapsed durations.
⏱️ Proposed refactor
- start_time = time.time()
+ start_time = time.perf_counter()
...
- duration = time.time() - start_time
+ duration = time.perf_counter() - start_time
...
- duration = time.time() - start_time
+ duration = time.perf_counter() - start_time
...
- stage_start_time = time.time()
+ stage_start_time = time.perf_counter()
...
- stage_duration = time.time() - stage_start_time
+ stage_duration = time.perf_counter() - stage_start_time
...
- stage_duration = time.time() - stage_start_time
+ stage_duration = time.perf_counter() - stage_start_timeAlso applies to: 49, 76, 94, 111, 137, 153
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/core/executor.py` around lines 47 - 49, Replace uses of the
wall-clock timer time.time() with a monotonic high-resolution timer
time.perf_counter() for all elapsed-time measurements in
backend/core/executor.py: change the start_time assignments (e.g. the start_time
set next to pipeline_active_executions.labels(pipeline_id=pipeline.id).inc())
and any subsequent elapsed = time.time() - start_time calculations to use
time.perf_counter() instead, and ensure imports/reference use the perf_counter
symbol; apply the same replacement at the other occurrences noted (the other
start_time/elapsed sites around lines 49, 76, 94, 111, 137, 153) so
histogram.observe calls use perf_counter-based durations.
| # Prometheus Metrics scraping endpoint | ||
| @app.get("/metrics", tags=["monitoring"]) | ||
| async def prometheus_metrics(): | ||
| """Prometheus metrics endpoint""" | ||
| return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST) |
There was a problem hiding this comment.
Protect /metrics; it currently bypasses auth.
When settings.AUTH_ENABLED is true, only the routers on Lines 97-110 get auth_dependencies; this endpoint stays anonymous. That exposes operational data and any user-derived label values to anyone who can reach the service. Gate it behind the same auth policy or an internal-only ingress, and keep it out of the public schema.
🔒 Proposed fix
+# Define this before the route declarations.
+metrics_dependencies = [Depends(get_current_auth_context)] if settings.AUTH_ENABLED else []
+
-@app.get("/metrics", tags=["monitoring"])
+@app.get("/metrics", tags=["monitoring"], include_in_schema=False, dependencies=metrics_dependencies)
async def prometheus_metrics():
"""Prometheus metrics endpoint"""
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/main.py` around lines 75 - 79, The /metrics endpoint
(prometheus_metrics) is currently exposed anonymously; when
settings.AUTH_ENABLED is true, protect it by adding the same authentication
dependency (auth_dependencies) to the app.get decorator or route registration
and hide it from the public OpenAPI (set include_in_schema=False) so it requires
auth and is not in the public schema; conditionalize the decorator or route
registration on settings.AUTH_ENABLED (or always attach auth_dependencies when
enabled) to ensure prometheus_metrics is only accessible to
authenticated/internal callers.
| stage_execution_duration_seconds = Histogram( | ||
| 'stage_execution_duration_seconds', | ||
| 'Stage execution duration in seconds', | ||
| ['pipeline_id', 'stage_name', 'status'], | ||
| buckets=(0.1, 0.5, 1.0, 5.0, 15.0, 30.0, 60.0, 120.0, 300.0, float('inf')) | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that `Stage.name` is unconstrained and flows directly into the metric label.
rg -n -C2 'class Stage|name:\s*str|stage_execution_duration_seconds|stage_name\s*=\s*stage\.name' backendRepository: fuzziecoder/Flexi-Roaster
Length of output: 8075
Avoid unbounded stage_name labels.
Stage.name is defined as a free-form str in backend/models/pipeline.py and passed directly into the stage_execution_duration_seconds metric label at backend/core/executor.py lines 140 and 156. Every distinct stage name creates a new time-series in Prometheus, which will eventually exhaust memory and degrade query latency. Use a bounded dimension (e.g., stage type or index) instead, and keep the free-form name in logs or traces.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/monitoring/metrics.py` around lines 39 - 44, The metric uses an
unbounded label 'stage_name' (stage_execution_duration_seconds) which will
create infinite Prometheus series; change the metric to use a bounded dimension
such as 'stage_type' or 'stage_index' instead and update all call sites that set
labels (references in executor where
stage_execution_duration_seconds.labels(...) is called) to provide the chosen
bounded value (e.g., Stage.type or a numeric stage index). Leave Stage.name
unchanged for logs/traces—emit the free-form name to logging/tracing code but
stop passing it into the metric labels. Ensure you update the metric declaration
(remove 'stage_name' label) and all uses in the executor (the label-setting
calls around the stage timing at the locations that currently pass stage.name)
to match the new label keys.
| "targets": [ | ||
| { | ||
| "expr": "sum(rate(pipeline_executions_total{status=\"success\"}[5m])) / sum(rate(pipeline_executions_total[5m])) * 100", | ||
| "refId": "A" | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In PromQL, how should a success / total * 100 query be written so Grafana shows 0 instead of NaN/N/A when the denominator is zero or the series is absent?
💡 Result:
Use a filter to drop the divide-by-zero series, then fallback to 0 when the result is empty:
(
100 * success / total
)
and on(<your_grouping_labels>)
(
total > 0
)
or on() vector(0)
total > 0(withoutbool) removes series where the denominator is0, so you don’t getNaN.... or on() vector(0)makes Grafana get a 0 when the whole expression returns no series (missing/absent data). [1]
If you’re aggregating (common case), apply the same idea to the aggregated vectors:
(
100 *
sum by (job) (success)
/
sum by (job) (total)
)
and on(job)
(
sum by (job) (total) > 0
)
or on() vector(0)
Sources: [1] https://grafana.com/docs/grafana/latest/alerting/guides/missing-data/ (Prometheus example: your_metric_query OR on() vector(0)), [2] https://stackoverflow.com/questions/47056557/how-to-gracefully-avoid-divide-by-zero-in-prometheus
🏁 Script executed:
find . -name "grafana_dashboard.json" -type fRepository: fuzziecoder/Flexi-Roaster
Length of output: 109
🏁 Script executed:
# Once we find the file, let's examine it
cat -n docs/monitoring/grafana_dashboard.json | head -100Repository: fuzziecoder/Flexi-Roaster
Length of output: 3701
Handle idle windows in the success-rate query.
When no executions occur in a 5-minute window, both sum(rate(...)) expressions return empty series. The denominator becomes 0 or absent, causing the division to produce NaN/N/A instead of 0%.
The proposed fix using or vector(0) and clamp_min() will work. However, PromQL documentation recommends a cleaner pattern that filters before calculating:
📊 Alternative approach
(
100 * sum(rate(pipeline_executions_total{status="success"}[5m]))
/
sum(rate(pipeline_executions_total[5m]))
)
and on() (sum(rate(pipeline_executions_total[5m])) > 0)
or on() vector(0)
This filters out the division when the denominator is 0, then falls back to 0 when the entire expression returns no series.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/monitoring/grafana_dashboard.json` around lines 24 - 28, The
success-rate PromQL expression for the target using the metric
pipeline_executions_total should be changed to handle idle 5m windows: replace
the current expr string with the pattern that performs the division only when
the denominator sum(rate(pipeline_executions_total[5m])) > 0 and otherwise
returns 0 (use the "and on() (... > 0) or on() vector(0)" post-filter approach).
Locate the "targets" -> "expr" entry containing the current
sum(rate(pipeline_executions_total...)) division and update it to the
recommended guarded expression so the panel yields 0% instead of NaN/N/A when no
executions occur.
| "expr": "histogram_quantile(0.95, sum(rate(pipeline_execution_duration_seconds_bucket[5m])) by (le, pipeline_id))", | ||
| "legendFormat": "{{pipeline_id}} (95% CI)", | ||
| "refId": "A" |
There was a problem hiding this comment.
Rename the legend; this is a percentile, not a confidence interval.
histogram_quantile(0.95, ...) returns a p95 estimate. Labeling it 95% CI is statistically incorrect and will mislead dashboard readers.
📝 Proposed fix
- "legendFormat": "{{pipeline_id}} (95% CI)",
+ "legendFormat": "{{pipeline_id}} p95",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/monitoring/grafana_dashboard.json` around lines 92 - 94, The legend text
is incorrect: the PromQL expression using histogram_quantile(0.95, ...) produces
a 95th percentile (p95), not a 95% confidence interval; update the legendFormat
value (currently "{{pipeline_id}} (95% CI)") to a correct label such as
"{{pipeline_id}} (p95)" or "{{pipeline_id}} (95th percentile)" so the dashboard
accurately reflects the metric returned by histogram_quantile.
| # Record success metrics | ||
| duration = time.time() - start_time | ||
| pipeline_executions_total.labels(pipeline_id=pipeline.id, status='success').inc() | ||
| pipeline_execution_duration_seconds.labels(pipeline_id=pipeline.id).observe(duration) | ||
| pipeline_active_executions.labels(pipeline_id=pipeline.id).dec() |
There was a problem hiding this comment.
🟡 Success-path metrics recording inside try block can overwrite COMPLETED status to FAILED
The Prometheus metrics recording for the success path (lines 76-79) is placed inside the try block, after the execution status is already set to ExecutionStatus.COMPLETED at line 67. If any of the metrics calls (pipeline_executions_total.labels(...).inc(), .observe(), or .dec()) raise an exception, the except Exception block at line 81 will catch it and overwrite execution.status to ExecutionStatus.FAILED at line 83 — even though all pipeline stages completed successfully. The returned Execution object would report a failed status, a misleading error message from the Prometheus client, and failure metrics would be emitted to Prometheus instead of success metrics. The metrics recording should be moved after the try/except, or into a separate try block, so it cannot interfere with the execution status determination.
Prompt for agents
In backend/core/executor.py, the metrics recording code for both success and failure paths (lines 75-79 and 93-98) should be restructured so that a failure in metrics recording cannot change the execution status. The recommended approach is to: 1) Move the pipeline_active_executions.dec() call into a finally block to guarantee gauge cleanup regardless of what happens. 2) Move the success metrics recording (lines 76-79) out of the try block and into a separate try/except or into a finally block with a status check. For example, after the main try/except, add a finally block that always decrements the gauge, and use a separate try/except for the Prometheus calls that cannot affect execution.status. The key invariant is: once execution.status is set to COMPLETED at line 67, no subsequent code path should overwrite it to FAILED unless the pipeline stages themselves failed.
Was this helpful? React with 👍 or 👎 to provide feedback.
What:
Implemented comprehensive Prometheus-style metrics collection for pipeline and stage performance monitoring, and exported a pre-configured Grafana Dashboard JSON for visualization.
Why:
Resolves the issue requesting real-time visibility into pipeline throughput, success/failure rates, active workloads, and execution duration trends.
How:
prometheus_clientCounters (total runs, failures), Histograms (duration distributions), and Gauges (active threads).GET /metricsin [backend/main.py].Testing:
GET /metricstext payload formatting in the browser.Addressing Issue #30
Summary by CodeRabbit
Release Notes