Skip to content

Implement Prometheus metrics and ELK/Grafana observability stack#90

Merged
fuzziecoder merged 2 commits intocodex/fix-remaining-issues-and-raise-prfrom
codex/implement-monitoring-and-observability-stack
Feb 25, 2026
Merged

Implement Prometheus metrics and ELK/Grafana observability stack#90
fuzziecoder merged 2 commits intocodex/fix-remaining-issues-and-raise-prfrom
codex/implement-monitoring-and-observability-stack

Conversation

@fuzziecoder
Copy link
Copy Markdown
Owner

@fuzziecoder fuzziecoder commented Feb 25, 2026

Motivation

  • Provide a production-ready observability baseline (Prometheus + Grafana + Elasticsearch/Logstash) so infrastructure and pipeline telemetry are available out of the box.
  • Capture key pipeline telemetry such as latency, failure rate, active executions and SLA breaches for SLA and reliability tracking.
  • Enable centralized JSON log shipping to Elasticsearch via Logstash so execution logs and application logs can be searched and correlated.

Description

  • Added a new observability module backend/observability with metrics.py implementing Prometheus metrics (counters, histograms, gauges) and helpers exposed via observability_metrics and a small logging.py TCP JSON handler for Logstash integration.
  • Exposed a Prometheus scrape endpoint at root GET /metrics that returns the Prometheus exposition format via observability_metrics.prometheus_payload() and instrumented the existing /api/metrics business endpoint to use real resource snapshots.
  • Instrumented request middleware (backend/api/middleware/logging_middleware.py) to record HTTP request counts and latencies and to emit metrics on exception paths, and instrumented execution lifecycle (backend/api/routes/executions.py) to record pipeline execution outcomes, latency, failure-rate gauge updates, active execution counts and SLA breach counting.
  • Added configuration knobs in backend/config.py for observability and log shipping (ENABLE_PROMETHEUS_METRICS, PIPELINE_SLA_TARGET_SECONDS, ENABLE_LOGSTASH_LOGGING, LOGSTASH_HOST, LOGSTASH_PORT) and added prometheus-client to backend/requirements.txt.
  • Added a simple local docker-compose.yml wiring to run Prometheus, Grafana, Elasticsearch and Logstash alongside the backend and documented usage and environment variables in backend/README.md.

Testing

  • Ran python -m compileall backend to ensure modules compile successfully (succeeded).
  • Ran unit tests with PYTHONPATH=. pytest backend/tests/test_security.py -q which completed successfully (4 passed).

Codex Task


Open with Devin

@vercel
Copy link
Copy Markdown

vercel bot commented Feb 25, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
flexi-roaster Ready Ready Preview, Comment Feb 25, 2026 1:24pm

@gitguardian
Copy link
Copy Markdown

gitguardian bot commented Feb 25, 2026

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
27568531 Triggered Username Password 705f784 backend/tests/test_security.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@fuzziecoder fuzziecoder merged commit 21b0575 into codex/fix-remaining-issues-and-raise-pr Feb 25, 2026
3 of 6 checks passed
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 25, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch codex/implement-monitoring-and-observability-stack

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +87 to +90
def observe_http_request(self, method: str, path: str, status_code: int, duration_seconds: float) -> None:
status = str(status_code)
self.http_requests_total.labels(method=method, path=path, status=status).inc()
self.http_request_latency_seconds.labels(method=method, path=path, status=status).observe(duration_seconds)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Unbounded Prometheus label cardinality from raw URL paths causes memory exhaustion

The HTTP request metrics use the raw request.url.path (which includes dynamic path segments like execution IDs) as a Prometheus label. Every unique path such as /api/executions/exec-abc123-... creates a new time series in the Prometheus client registry.

Root Cause and Impact

In backend/api/middleware/logging_middleware.py:112-113, the raw path is captured:

path = request.url.path

This is then passed directly to observe_http_request at line 161, which uses it as a Prometheus label in backend/observability/metrics.py:89-90:

self.http_requests_total.labels(method=method, path=path, status=status).inc()
self.http_request_latency_seconds.labels(method=method, path=path, status=status).observe(duration_seconds)

Routes like /api/executions/{execution_id} and /api/executions/{execution_id}/logs produce a new unique label combination for every execution. This is a well-documented Prometheus anti-pattern that causes:

  • Monotonically increasing memory usage in the application process (the prometheus_client registry never forgets label sets)
  • Prometheus server memory/disk exhaustion from scraping ever-growing series
  • Degraded query performance in Grafana dashboards

Expected: Paths should be normalized to their route templates (e.g., /api/executions/{execution_id}) before being used as metric labels to keep cardinality bounded.

Prompt for agents
In backend/api/middleware/logging_middleware.py, before passing `path` to `observability_metrics.observe_http_request()`, normalize the path to its route template to prevent unbounded Prometheus label cardinality. One approach: use FastAPI's request.scope to get the matched route pattern. For example, around line 112 after `path = request.url.path`, add a helper that extracts the route template:

    route = request.scope.get("route")
    metric_path = route.path if route and hasattr(route, "path") else path

Then pass `metric_path` instead of `path` to `observe_http_request()` at lines 159-164 and 176-180. This ensures paths like /api/executions/exec-abc123 are recorded as /api/executions/{execution_id}, keeping label cardinality bounded.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +75 to +76
load1, _, _ = os.getloadavg()
cpu_usage = max(min(load1 * 100, 100.0), 0.0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 CPU usage metric derived from load average is not a valid CPU percentage

Both backend/api/routes/metrics.py:75-76 and backend/observability/metrics.py:139-141 compute CPU usage as load1 * 100, treating the 1-minute load average from os.getloadavg() as if it were a CPU utilization fraction.

Detailed Explanation

The 1-minute load average represents the average number of processes in the system's run queue — it is not a fraction of CPU capacity. On a multi-core system a load of 4.0 is normal (not 400% CPU). Conversely, a load of 0.5 on an otherwise idle 16-core machine would be reported as 50% CPU usage.

In backend/api/routes/metrics.py:75-76:

load1, _, _ = os.getloadavg()
cpu_usage = max(min(load1 * 100, 100.0), 0.0)

And the same logic in backend/observability/metrics.py:139-141:

load1, _, _ = os.getloadavg()
cpu_guess = max(load1 * 100.0, 0.0)
self.process_cpu_percent.set(cpu_guess)

Note that the Prometheus gauge version at metrics.py:140 doesn't even apply min(..., 100.0), so the gauge can report values above 100 (e.g., load of 2.0 → 200.0), which violates the "percent" semantic of the metric name flexiroaster_process_cpu_percent.

Impact: CPU usage metrics will be misleading in dashboards and SLA tracking. On single-core systems with low load this might look plausible, masking the fact that the numbers are semantically wrong on multi-core systems.

Prompt for agents
In both backend/api/routes/metrics.py (lines 74-78) and backend/observability/metrics.py (lines 138-141), the load average is incorrectly used as CPU percentage. To fix:

1. Divide load1 by the number of CPUs to get a utilization ratio: use os.cpu_count() or multiprocessing.cpu_count().
2. Apply: cpu_usage = max(min((load1 / os.cpu_count()) * 100.0, 100.0), 0.0)

This converts the load average into a rough per-CPU utilization percentage that makes sense on multi-core systems. Alternatively, if psutil is available, prefer psutil.cpu_percent() which gives the actual CPU utilization.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant