Implement Prometheus metrics and ELK/Grafana observability stack#90
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…lement-monitoring-and-observability-stack
|
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 27568531 | Triggered | Username Password | 705f784 | backend/tests/test_security.py | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secret safely. Learn here the best practices.
- Revoke and rotate this secret.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
21b0575
into
codex/fix-remaining-issues-and-raise-pr
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
| def observe_http_request(self, method: str, path: str, status_code: int, duration_seconds: float) -> None: | ||
| status = str(status_code) | ||
| self.http_requests_total.labels(method=method, path=path, status=status).inc() | ||
| self.http_request_latency_seconds.labels(method=method, path=path, status=status).observe(duration_seconds) |
There was a problem hiding this comment.
🔴 Unbounded Prometheus label cardinality from raw URL paths causes memory exhaustion
The HTTP request metrics use the raw request.url.path (which includes dynamic path segments like execution IDs) as a Prometheus label. Every unique path such as /api/executions/exec-abc123-... creates a new time series in the Prometheus client registry.
Root Cause and Impact
In backend/api/middleware/logging_middleware.py:112-113, the raw path is captured:
path = request.url.pathThis is then passed directly to observe_http_request at line 161, which uses it as a Prometheus label in backend/observability/metrics.py:89-90:
self.http_requests_total.labels(method=method, path=path, status=status).inc()
self.http_request_latency_seconds.labels(method=method, path=path, status=status).observe(duration_seconds)Routes like /api/executions/{execution_id} and /api/executions/{execution_id}/logs produce a new unique label combination for every execution. This is a well-documented Prometheus anti-pattern that causes:
- Monotonically increasing memory usage in the application process (the
prometheus_clientregistry never forgets label sets) - Prometheus server memory/disk exhaustion from scraping ever-growing series
- Degraded query performance in Grafana dashboards
Expected: Paths should be normalized to their route templates (e.g., /api/executions/{execution_id}) before being used as metric labels to keep cardinality bounded.
Prompt for agents
In backend/api/middleware/logging_middleware.py, before passing `path` to `observability_metrics.observe_http_request()`, normalize the path to its route template to prevent unbounded Prometheus label cardinality. One approach: use FastAPI's request.scope to get the matched route pattern. For example, around line 112 after `path = request.url.path`, add a helper that extracts the route template:
route = request.scope.get("route")
metric_path = route.path if route and hasattr(route, "path") else path
Then pass `metric_path` instead of `path` to `observe_http_request()` at lines 159-164 and 176-180. This ensures paths like /api/executions/exec-abc123 are recorded as /api/executions/{execution_id}, keeping label cardinality bounded.
Was this helpful? React with 👍 or 👎 to provide feedback.
| load1, _, _ = os.getloadavg() | ||
| cpu_usage = max(min(load1 * 100, 100.0), 0.0) |
There was a problem hiding this comment.
🟡 CPU usage metric derived from load average is not a valid CPU percentage
Both backend/api/routes/metrics.py:75-76 and backend/observability/metrics.py:139-141 compute CPU usage as load1 * 100, treating the 1-minute load average from os.getloadavg() as if it were a CPU utilization fraction.
Detailed Explanation
The 1-minute load average represents the average number of processes in the system's run queue — it is not a fraction of CPU capacity. On a multi-core system a load of 4.0 is normal (not 400% CPU). Conversely, a load of 0.5 on an otherwise idle 16-core machine would be reported as 50% CPU usage.
In backend/api/routes/metrics.py:75-76:
load1, _, _ = os.getloadavg()
cpu_usage = max(min(load1 * 100, 100.0), 0.0)And the same logic in backend/observability/metrics.py:139-141:
load1, _, _ = os.getloadavg()
cpu_guess = max(load1 * 100.0, 0.0)
self.process_cpu_percent.set(cpu_guess)Note that the Prometheus gauge version at metrics.py:140 doesn't even apply min(..., 100.0), so the gauge can report values above 100 (e.g., load of 2.0 → 200.0), which violates the "percent" semantic of the metric name flexiroaster_process_cpu_percent.
Impact: CPU usage metrics will be misleading in dashboards and SLA tracking. On single-core systems with low load this might look plausible, masking the fact that the numbers are semantically wrong on multi-core systems.
Prompt for agents
In both backend/api/routes/metrics.py (lines 74-78) and backend/observability/metrics.py (lines 138-141), the load average is incorrectly used as CPU percentage. To fix:
1. Divide load1 by the number of CPUs to get a utilization ratio: use os.cpu_count() or multiprocessing.cpu_count().
2. Apply: cpu_usage = max(min((load1 / os.cpu_count()) * 100.0, 100.0), 0.0)
This converts the load average into a rough per-CPU utilization percentage that makes sense on multi-core systems. Alternatively, if psutil is available, prefer psutil.cpu_percent() which gives the actual CPU utilization.
Was this helpful? React with 👍 or 👎 to provide feedback.
Motivation
Description
backend/observabilitywithmetrics.pyimplementing Prometheus metrics (counters, histograms, gauges) and helpers exposed viaobservability_metricsand a smalllogging.pyTCP JSON handler for Logstash integration.GET /metricsthat returns the Prometheus exposition format viaobservability_metrics.prometheus_payload()and instrumented the existing/api/metricsbusiness endpoint to use real resource snapshots.backend/api/middleware/logging_middleware.py) to record HTTP request counts and latencies and to emit metrics on exception paths, and instrumented execution lifecycle (backend/api/routes/executions.py) to record pipeline execution outcomes, latency, failure-rate gauge updates, active execution counts and SLA breach counting.backend/config.pyfor observability and log shipping (ENABLE_PROMETHEUS_METRICS,PIPELINE_SLA_TARGET_SECONDS,ENABLE_LOGSTASH_LOGGING,LOGSTASH_HOST,LOGSTASH_PORT) and addedprometheus-clienttobackend/requirements.txt.docker-compose.ymlwiring to run Prometheus, Grafana, Elasticsearch and Logstash alongside the backend and documented usage and environment variables inbackend/README.md.Testing
python -m compileall backendto ensure modules compile successfully (succeeded).PYTHONPATH=. pytest backend/tests/test_security.py -qwhich completed successfully (4 passed).Codex Task