Skip to content

Feat/health checks#275

Open
madsysharma wants to merge 8 commits into
imDarshanGK:mainfrom
madsysharma:feat/health-checks
Open

Feat/health checks#275
madsysharma wants to merge 8 commits into
imDarshanGK:mainfrom
madsysharma:feat/health-checks

Conversation

@madsysharma
Copy link
Copy Markdown

Description

Adds operational endpoints requested in #219: a Prometheus metrics scrape target and Kubernetes-style liveness/readiness probes.

New endpoints

Endpoint Purpose Behaviour
GET /healthz/live Liveness probe Returns 200 {"status":"ok"} while the process can answer HTTP. Does not check external dependencies - Kubernetes restarts the container on failure, so this must never depend on recoverable backends.
GET /healthz/ready Readiness probe Runs SELECT 1 against the SQLAlchemy engine. Returns 200 when all checks pass, 503 with a per-check breakdown when any fail. Kubernetes pulls the pod out of the Service load balancer on failure but does not restart it.
GET /metrics Prometheus exposition Counters/histograms/gauges for request totals, latency, in-flight count, and unaddressed exceptions, plus an app_info gauge.

Metric families exposed

Metric Type Labels Description
qyverixai_http_requests_total Counter method, endpoint, status_code Total number of requests processed.
qyverixai_http_request_duration_seconds Histogram method, endpoint Request latency. Buckets: 5 ms -> 30 s.
qyverixai_http_requests_in_progress Gauge method, endpoint Concurrent in-flight requests.
qyverixai_http_request_exceptions_total Counter method, endpoint, exception_type Unaddressed exceptions raised during request handling.
qyverixai_app_info Gauge version, ai_provider Static identity, which is always 1.

Design choices worth highlighting in review

  • Route template, not raw path, as the endpoint label.: the middleware reads request.scope["route"].path after routing, so /share/abc and /share/def collapse into a single series labelled endpoint="/share/{token}". This avoids the classic Prometheus footgun where dynamic path segments balloon label cardinality. There's a dedicated test (test_metrics_uses_route_template_not_raw_path) guarding this invariant.

  • /metrics is excluded from observation.: scrapes would otherwise feed back into the request_total counter on every interval. Tested via test_metrics_endpoint_excludes_itself.

  • METRICS_ENABLED and METRICS_AUTH_TOKEN are read at request time, not import time.: operators can flip them without restarting; tests don't have to reload modules to exercise them (which would re-register metrics and trip Duplicated timeseries in CollectorRegistry).

  • PROMETHEUS_MULTIPROC_DIR is honoured.: when running uvicorn --workers N > 1, set this env var to a writable directory and /metrics will aggregate across workers via prometheus_client.multiprocess.MultiProcessCollector.

  • Optional bearer auth on /metrics.: set METRICS_AUTH_TOKEN to require Authorization: Bearer <token> on scrapes - this is useful when the endpoint is reachable from outside the cluster.

  • Existing /health and /ping are untouched.: anything currently pointing at them keeps functioning.
    Files added

  • backend/app/observability.py - metric definitions + HTTP middleware.

  • backend/app/routers/health.py - /healthz/live, /healthz/ready.

  • backend/app/routers/metrics.py - /metrics endpoint with auth and disabled handling.

  • backend/tests/test_health_metrics.py - 8 tests covering happy paths, the degraded readiness path, label cardinality, scrape self-exclusion, bearer-auth gating, and the disabled state.

  • deploy/k8s/deployment.example.yaml - example Deployment + Service with livenessProbe, readinessProbe, startupProbe, and Prometheus scrape annotations.

  • deploy/prometheus/scrape-config.example.yaml - drop-in scrape job.
    Files modified

  • backend/app/main.py - now registers the metrics middleware (installed unconditionally; self-disables per request when METRICS_ENABLED=false), includes the two new routers, and initialises the app_info gauge in the lifespan handler.

  • backend/app/schemas.py - now adds LivenessResponse and ReadinessResponse.

  • backend/requirements.txt - now adds prometheus-client>=0.20.0.

  • Dockerfile and backend/Dockerfile - now adds a HEALTHCHECK directive that hits /healthz/live.

  • .env.example - now documents METRICS_ENABLED, METRICS_AUTH_TOKEN, PROMETHEUS_MULTIPROC_DIR.

  • README.md - now documents the new Observability section with metric tables, degraded-response example, and links to the deploy/ examples.
    Example: degraded readiness response

{
  "status": "degraded",
  "checks": {
    "database": {
      "ok": false,
      "elapsed_ms": 2003.41,
      "error": "OperationalError: connection refused"
    }
  }
}

Related Issue

Fixes #219

Type of change

  • Bug fix
  • New feature / enhancement
  • Documentation update
  • Test addition
  • Refactor

Checklist

  • I have read CONTRIBUTING.md
  • My branch is up to date with main
  • I have run pytest -v and all tests pass
  • I have not introduced duplicate issues or features
  • My PR title follows the format: feat/fix/docs/test: short description
  • I have added tests for new features (Level 2 and 3 issues)
  • No hardcoded secrets or API keys in my code
  • This PR is linked to a GSSoC 2026 issue

Screenshots (if frontend change)

Not applicable as this is a backend-only change.

Test evidence

Full suite: 76 passed (68 pre-existing + 8 new). Ruff clean against the same selectors the CI workflow uses (ruff check backend/app --select E,F,W --ignore E501).

$ cd backend && pytest -v
...
tests/test_health_metrics.py::test_liveness_returns_ok PASSED            [ 77%]
tests/test_health_metrics.py::test_readiness_returns_ok_when_db_reachable PASSED [ 78%]
tests/test_health_metrics.py::test_readiness_returns_503_when_db_check_fails PASSED [ 80%]
tests/test_health_metrics.py::test_metrics_endpoint_returns_prometheus_format PASSED [ 81%]
tests/test_health_metrics.py::test_metrics_uses_route_template_not_raw_path PASSED [ 82%]
tests/test_health_metrics.py::test_metrics_endpoint_excludes_itself PASSED [ 84%]
tests/test_health_metrics.py::test_metrics_requires_token_when_configured PASSED [ 85%]
tests/test_health_metrics.py::test_metrics_endpoint_404_when_disabled PASSED [ 86%]
...
============================== 76 passed in 2.07s ==============================
Screenshot 2026-05-23 173228 Screenshot 2026-05-23 173242 Screenshot 2026-05-23 173254

Manual smoke test against a running instance

# Liveness - always cheap, no dependencies
$ curl -s http://localhost:8000/healthz/live
{"status":"ok"}
Screenshot 2026-05-23 175247
# Readiness - runs SELECT 1; flips to 503 + per-check breakdown when the DB is down
$ curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8000/healthz/ready
200

# Prometheus scrape
$ curl -s http://localhost:8000/metrics | grep '^qyverixai_http_requests_total' | head
qyverixai_http_requests_total{endpoint="/healthz/live",method="GET",status_code="200"} 1.0
qyverixai_http_requests_total{endpoint="/healthz/ready",method="GET",status_code="200"} 2.0
Screenshot 2026-05-23 173829

Note the endpoint label values are route templates, not raw URLs.

@madsysharma madsysharma requested a review from imDarshanGK as a code owner May 23, 2026 12:26
@madsysharma
Copy link
Copy Markdown
Author

Hi @imDarshanGK , please review this PR. Also, could you please add the "gssoc:approved" label? Thank you.

1 similar comment
@madsysharma
Copy link
Copy Markdown
Author

Hi @imDarshanGK , please review this PR. Also, could you please add the "gssoc:approved" label? Thank you.

@github-actions github-actions Bot added type:backend Backend related tasks testing labels May 26, 2026
@madsysharma
Copy link
Copy Markdown
Author

madsysharma commented May 26, 2026

Hi @imDarshanGK , please review this updated PR when you find the time to. Also, please add the 'gssoc:approved' label for this contribution to be tracked in GSSoC. Thanks.

@imDarshanGK imDarshanGK added gssoc2026 Official GSSoC 2026 issue level:intermediate Intermediate tasks performance Performance improvement and removed testing labels May 26, 2026
@imDarshanGK
Copy link
Copy Markdown
Owner

@madsysharma update the branch with the latest main changes and CI failures

@madsysharma
Copy link
Copy Markdown
Author

madsysharma commented May 26, 2026

Hi @imDarshanGK , please check the updated PR now (have addressed the merge issues and CI errors). And please add the 'gssoc:approved' label once you've checked that it's good to go. Thank you.

@imDarshanGK
Copy link
Copy Markdown
Owner

@madsysharma update the branch with the latest main changes.

@madsysharma
Copy link
Copy Markdown
Author

Hi @imDarshanGK , have resolved the merge conflicts. Please review this updated PR, and add the 'gssoc:approved' label once it's good to go. Thank you.

@imDarshanGK
Copy link
Copy Markdown
Owner

@madsysharma update the branch with the latest main changes, CI failing

@madsysharma madsysharma force-pushed the feat/health-checks branch from fc31889 to bb520e0 Compare May 28, 2026 19:15
@madsysharma
Copy link
Copy Markdown
Author

Hi @imDarshanGK , could you please check now? Thanks.

@madsysharma
Copy link
Copy Markdown
Author

Hi @imDarshanGK , could you please review this updated PR? Thanks.

@imDarshanGK
Copy link
Copy Markdown
Owner

@madsysharma Sync the branch with the latest main

@madsysharma
Copy link
Copy Markdown
Author

Hi @imDarshanGK , please check now. Thank you.

@imDarshanGK
Copy link
Copy Markdown
Owner

@madsysharma Resolve conflicts

@madsysharma
Copy link
Copy Markdown
Author

Hi @imDarshanGK , have resolved the merge conflicts. Please check. Thank you.

Copy link
Copy Markdown
Author

@madsysharma madsysharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @imDarshanGK , please review this updated PR. Thank you.

Copy link
Copy Markdown
Author

@madsysharma madsysharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR

@madsysharma madsysharma force-pushed the feat/health-checks branch from 2d76a09 to 4af6089 Compare May 30, 2026 20:37
Copy link
Copy Markdown
Author

@madsysharma madsysharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated files in PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gssoc2026 Official GSSoC 2026 issue level:intermediate Intermediate tasks performance Performance improvement type:backend Backend related tasks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add health checks and Prometheus metrics endpoint

2 participants