Feat/health checks#275
Conversation
|
Hi @imDarshanGK , please review this PR. Also, could you please add the "gssoc:approved" label? Thank you. |
1 similar comment
|
Hi @imDarshanGK , please review this PR. Also, could you please add the "gssoc:approved" label? Thank you. |
|
Hi @imDarshanGK , please review this updated PR when you find the time to. Also, please add the 'gssoc:approved' label for this contribution to be tracked in GSSoC. Thanks. |
|
@madsysharma update the branch with the latest main changes and CI failures |
7cd5d0b to
2b9f521
Compare
|
Hi @imDarshanGK , please check the updated PR now (have addressed the merge issues and CI errors). And please add the 'gssoc:approved' label once you've checked that it's good to go. Thank you. |
|
@madsysharma update the branch with the latest main changes. |
|
Hi @imDarshanGK , have resolved the merge conflicts. Please review this updated PR, and add the 'gssoc:approved' label once it's good to go. Thank you. |
|
@madsysharma update the branch with the latest main changes, CI failing |
fc31889 to
bb520e0
Compare
|
Hi @imDarshanGK , could you please check now? Thanks. |
|
Hi @imDarshanGK , could you please review this updated PR? Thanks. |
|
@madsysharma Sync the branch with the latest main |
|
Hi @imDarshanGK , please check now. Thank you. |
|
@madsysharma Resolve conflicts |
|
Hi @imDarshanGK , have resolved the merge conflicts. Please check. Thank you. |
madsysharma
left a comment
There was a problem hiding this comment.
Hi @imDarshanGK , please review this updated PR. Thank you.
Removed duplicate and unnecessary routes from the list.
2d76a09 to
4af6089
Compare
Description
Adds operational endpoints requested in #219: a Prometheus metrics scrape target and Kubernetes-style liveness/readiness probes.
New endpoints
GET /healthz/live200 {"status":"ok"}while the process can answer HTTP. Does not check external dependencies - Kubernetes restarts the container on failure, so this must never depend on recoverable backends.GET /healthz/readySELECT 1against the SQLAlchemy engine. Returns200when all checks pass,503with a per-check breakdown when any fail. Kubernetes pulls the pod out of the Service load balancer on failure but does not restart it.GET /metricsapp_infogauge.Metric families exposed
qyverixai_http_requests_totalmethod,endpoint,status_codeqyverixai_http_request_duration_secondsmethod,endpointqyverixai_http_requests_in_progressmethod,endpointqyverixai_http_request_exceptions_totalmethod,endpoint,exception_typeqyverixai_app_infoversion,ai_provider1.Design choices worth highlighting in review
Route template, not raw path, as the
endpointlabel.: the middleware readsrequest.scope["route"].pathafter routing, so/share/abcand/share/defcollapse into a single series labelledendpoint="/share/{token}". This avoids the classic Prometheus footgun where dynamic path segments balloon label cardinality. There's a dedicated test (test_metrics_uses_route_template_not_raw_path) guarding this invariant./metricsis excluded from observation.: scrapes would otherwise feed back into the request_total counter on every interval. Tested viatest_metrics_endpoint_excludes_itself.METRICS_ENABLEDandMETRICS_AUTH_TOKENare read at request time, not import time.: operators can flip them without restarting; tests don't have to reload modules to exercise them (which would re-register metrics and tripDuplicated timeseries in CollectorRegistry).PROMETHEUS_MULTIPROC_DIRis honoured.: when runninguvicorn --workers N > 1, set this env var to a writable directory and/metricswill aggregate across workers viaprometheus_client.multiprocess.MultiProcessCollector.Optional bearer auth on
/metrics.: setMETRICS_AUTH_TOKENto requireAuthorization: Bearer <token>on scrapes - this is useful when the endpoint is reachable from outside the cluster.Existing
/healthand/pingare untouched.: anything currently pointing at them keeps functioning.Files added
backend/app/observability.py- metric definitions + HTTP middleware.backend/app/routers/health.py-/healthz/live,/healthz/ready.backend/app/routers/metrics.py-/metricsendpoint with auth and disabled handling.backend/tests/test_health_metrics.py- 8 tests covering happy paths, the degraded readiness path, label cardinality, scrape self-exclusion, bearer-auth gating, and the disabled state.deploy/k8s/deployment.example.yaml- example Deployment + Service withlivenessProbe,readinessProbe,startupProbe, and Prometheus scrape annotations.deploy/prometheus/scrape-config.example.yaml- drop-in scrape job.Files modified
backend/app/main.py- now registers the metrics middleware (installed unconditionally; self-disables per request whenMETRICS_ENABLED=false), includes the two new routers, and initialises theapp_infogauge in the lifespan handler.backend/app/schemas.py- now addsLivenessResponseandReadinessResponse.backend/requirements.txt- now addsprometheus-client>=0.20.0.Dockerfileandbackend/Dockerfile- now adds aHEALTHCHECKdirective that hits/healthz/live..env.example- now documentsMETRICS_ENABLED,METRICS_AUTH_TOKEN,PROMETHEUS_MULTIPROC_DIR.README.md- now documents the new Observability section with metric tables, degraded-response example, and links to thedeploy/examples.Example: degraded readiness response
{ "status": "degraded", "checks": { "database": { "ok": false, "elapsed_ms": 2003.41, "error": "OperationalError: connection refused" } } }Related Issue
Fixes #219
Type of change
Checklist
mainpytest -vand all tests passfeat/fix/docs/test: short descriptionScreenshots (if frontend change)
Not applicable as this is a backend-only change.
Test evidence
Full suite: 76 passed (68 pre-existing + 8 new). Ruff clean against the same selectors the CI workflow uses (
ruff check backend/app --select E,F,W --ignore E501).Manual smoke test against a running instance
Note the
endpointlabel values are route templates, not raw URLs.