Implement Prometheus metrics and ELK/Grafana observability stack by fuzziecoder · Pull Request #90 · fuzziecoder/Flexi-Roaster

fuzziecoder · 2026-02-25T13:23:38Z

Motivation

Provide a production-ready observability baseline (Prometheus + Grafana + Elasticsearch/Logstash) so infrastructure and pipeline telemetry are available out of the box.
Capture key pipeline telemetry such as latency, failure rate, active executions and SLA breaches for SLA and reliability tracking.
Enable centralized JSON log shipping to Elasticsearch via Logstash so execution logs and application logs can be searched and correlated.

Description

Added a new observability module backend/observability with metrics.py implementing Prometheus metrics (counters, histograms, gauges) and helpers exposed via observability_metrics and a small logging.py TCP JSON handler for Logstash integration.
Exposed a Prometheus scrape endpoint at root GET /metrics that returns the Prometheus exposition format via observability_metrics.prometheus_payload() and instrumented the existing /api/metrics business endpoint to use real resource snapshots.
Instrumented request middleware (backend/api/middleware/logging_middleware.py) to record HTTP request counts and latencies and to emit metrics on exception paths, and instrumented execution lifecycle (backend/api/routes/executions.py) to record pipeline execution outcomes, latency, failure-rate gauge updates, active execution counts and SLA breach counting.
Added configuration knobs in backend/config.py for observability and log shipping (ENABLE_PROMETHEUS_METRICS, PIPELINE_SLA_TARGET_SECONDS, ENABLE_LOGSTASH_LOGGING, LOGSTASH_HOST, LOGSTASH_PORT) and added prometheus-client to backend/requirements.txt.
Added a simple local docker-compose.yml wiring to run Prometheus, Grafana, Elasticsearch and Logstash alongside the backend and documented usage and environment variables in backend/README.md.

Testing

Ran python -m compileall backend to ensure modules compile successfully (succeeded).
Ran unit tests with PYTHONPATH=. pytest backend/tests/test_security.py -q which completed successfully (4 passed).

Codex Task

vercel · 2026-02-25T13:23:40Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
flexi-roaster	Ready	Preview, Comment	Feb 25, 2026 1:24pm

…lement-monitoring-and-observability-stack

gitguardian · 2026-02-25T13:24:20Z

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
27568531	Triggered	Username Password	`705f784`	backend/tests/test_security.py	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secret safely. Learn here the best practices.
Revoke and rotate this secret.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

coderabbitai · 2026-02-25T13:24:50Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch codex/implement-monitoring-and-observability-stack

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

devin-ai-integration

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-02-25T13:27:59Z

backend/observability/metrics.py

+    def observe_http_request(self, method: str, path: str, status_code: int, duration_seconds: float) -> None:
+        status = str(status_code)
+        self.http_requests_total.labels(method=method, path=path, status=status).inc()
+        self.http_request_latency_seconds.labels(method=method, path=path, status=status).observe(duration_seconds)


🔴 Unbounded Prometheus label cardinality from raw URL paths causes memory exhaustion

The HTTP request metrics use the raw request.url.path (which includes dynamic path segments like execution IDs) as a Prometheus label. Every unique path such as /api/executions/exec-abc123-... creates a new time series in the Prometheus client registry.

Root Cause and Impact

In backend/api/middleware/logging_middleware.py:112-113, the raw path is captured:

path = request.url.path

This is then passed directly to observe_http_request at line 161, which uses it as a Prometheus label in backend/observability/metrics.py:89-90:

self.http_requests_total.labels(method=method, path=path, status=status).inc() self.http_request_latency_seconds.labels(method=method, path=path, status=status).observe(duration_seconds)

Routes like /api/executions/{execution_id} and /api/executions/{execution_id}/logs produce a new unique label combination for every execution. This is a well-documented Prometheus anti-pattern that causes:

Monotonically increasing memory usage in the application process (the prometheus_client registry never forgets label sets)

Prometheus server memory/disk exhaustion from scraping ever-growing series

Degraded query performance in Grafana dashboards

Expected: Paths should be normalized to their route templates (e.g., /api/executions/{execution_id}) before being used as metric labels to keep cardinality bounded.

Prompt for agents

In backend/api/middleware/logging_middleware.py, before passing `path` to `observability_metrics.observe_http_request()`, normalize the path to its route template to prevent unbounded Prometheus label cardinality. One approach: use FastAPI's request.scope to get the matched route pattern. For example, around line 112 after `path = request.url.path`, add a helper that extracts the route template: route = request.scope.get("route") metric_path = route.path if route and hasattr(route, "path") else path Then pass `metric_path` instead of `path` to `observe_http_request()` at lines 159-164 and 176-180. This ensures paths like /api/executions/exec-abc123 are recorded as /api/executions/{execution_id}, keeping label cardinality bounded.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-02-25T13:28:01Z

backend/api/routes/metrics.py

+        load1, _, _ = os.getloadavg()
+        cpu_usage = max(min(load1 * 100, 100.0), 0.0)


🟡 CPU usage metric derived from load average is not a valid CPU percentage

Both backend/api/routes/metrics.py:75-76 and backend/observability/metrics.py:139-141 compute CPU usage as load1 * 100, treating the 1-minute load average from os.getloadavg() as if it were a CPU utilization fraction.

Detailed Explanation

The 1-minute load average represents the average number of processes in the system's run queue — it is not a fraction of CPU capacity. On a multi-core system a load of 4.0 is normal (not 400% CPU). Conversely, a load of 0.5 on an otherwise idle 16-core machine would be reported as 50% CPU usage.

In backend/api/routes/metrics.py:75-76:

load1, _, _ = os.getloadavg() cpu_usage = max(min(load1 * 100, 100.0), 0.0)

And the same logic in backend/observability/metrics.py:139-141:

load1, _, _ = os.getloadavg() cpu_guess = max(load1 * 100.0, 0.0) self.process_cpu_percent.set(cpu_guess)

Note that the Prometheus gauge version at metrics.py:140 doesn't even apply min(..., 100.0), so the gauge can report values above 100 (e.g., load of 2.0 → 200.0), which violates the "percent" semantic of the metric name flexiroaster_process_cpu_percent.

Impact: CPU usage metrics will be misleading in dashboards and SLA tracking. On single-core systems with low load this might look plausible, masking the fact that the numbers are semantically wrong on multi-core systems.

Prompt for agents

In both backend/api/routes/metrics.py (lines 74-78) and backend/observability/metrics.py (lines 138-141), the load average is incorrectly used as CPU percentage. To fix: 1. Divide load1 by the number of CPUs to get a utilization ratio: use os.cpu_count() or multiprocessing.cpu_count(). 2. Apply: cpu_usage = max(min((load1 / os.cpu_count()) * 100.0, 100.0), 0.0) This converts the load average into a rough per-CPU utilization percentage that makes sense on multi-core systems. Alternatively, if psutil is available, prefer psutil.cpu_percent() which gives the actual CPU utilization.

Was this helpful? React with 👍 or 👎 to provide feedback.

Add Prometheus metrics and ELK/Grafana observability stack

e807330

fuzziecoder added the codex label Feb 25, 2026 — with ChatGPT Codex Connector

fuzziecoder self-assigned this Feb 25, 2026

fuzziecoder added apertre3.0 hard and removed codex labels Feb 25, 2026

vercel bot deployed to Preview February 25, 2026 13:24 View deployment

Merge branch 'codex/fix-remaining-issues-and-raise-pr' into codex/imp…

705f784

…lement-monitoring-and-observability-stack

fuzziecoder merged commit 21b0575 into codex/fix-remaining-issues-and-raise-pr Feb 25, 2026
3 of 6 checks passed

vercel bot deployed to Preview February 25, 2026 13:24 View deployment

devin-ai-integration bot reviewed Feb 25, 2026

View reviewed changes

fuzziecoder removed the apertre3.0 label Mar 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Prometheus metrics and ELK/Grafana observability stack#90

Implement Prometheus metrics and ELK/Grafana observability stack#90
fuzziecoder merged 2 commits intocodex/fix-remaining-issues-and-raise-prfrom
codex/implement-monitoring-and-observability-stack

fuzziecoder commented Feb 25, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

vercel bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

gitguardian bot commented Feb 25, 2026

Uh oh!

Uh oh!

coderabbitai bot commented Feb 25, 2026

Review skipped

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Feb 25, 2026

Uh oh!

devin-ai-integration bot Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		load1, _, _ = os.getloadavg()
		cpu_usage = max(min(load1 * 100, 100.0), 0.0)

Uh oh!

Conversation

fuzziecoder commented Feb 25, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Testing

Uh oh!

vercel bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gitguardian bot commented Feb 25, 2026

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 25, 2026

Review skipped

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fuzziecoder commented Feb 25, 2026 •

edited by devin-ai-integration bot

Loading

vercel bot commented Feb 25, 2026 •

edited

Loading