feat(observability): production-grade metrics, alerting, synthetic canary & SLOs (#650)#665
Open
CelestinaBeing wants to merge 2 commits into
Open
Conversation
FinesseStudioLab#650) Implements the full observability & runtime-reliability epic: **RPC pool safety** - Add acquire/release concurrency gate to rpcPool.js (maxConcurrent=10, acquireTimeoutMs=5000). Waiters over the timeout receive a typed PoolSaturatedError (code POOL_SATURATED) so the error handler can return 503 instead of letting callers hang. - getStatus() exposes in_use/idle/waiting/healthy/unhealthy for Prometheus. **Request deadlines** - New backend/src/middleware/timeout.js: requestTimeout(ms) attaches an AbortSignal to req.signal, propagates client-disconnect aborts, and returns 504 REQUEST_TIMEOUT when the wall-clock deadline fires. - Wired globally in index.js (DEFAULT_REQUEST_TIMEOUT_MS=30000, overrideable via env REQUEST_TIMEOUT_MS). **Prometheus metrics** - /metrics now emits a full histogram for trivela_http_request_duration_ms (buckets: 50/100/200/500/1000/2000/5000/10000/30000/+Inf) enabling histogram_quantile p50/p95/p99 in Grafana and PromQL. - RPC pool gauges: trivela_rpc_pool_in_use, _idle, _waiting, _healthy, _unhealthy — consumed by RpcPoolSaturated and AllRpcEndpointsUnhealthy alerts. **Graceful shutdown** - SIGTERM/SIGINT handler in startServer(): stops accepting new connections, forces exit after SHUTDOWN_GRACE_MS (default 15 s), flushes OpenTelemetry traces via shutdownTracing(), then exits cleanly. **Prometheus alerting rules** (monitoring/alerting/alerting_rules.yml) - trivela_backend: HighBackendErrorRate (>5% for 5 m, critical), HighP95Latency (p95 > 1 s), BackendProcessRestart, BackendDown, AuthFailureSpike, AuthLockoutTriggered. - trivela_rpc: AllRpcEndpointsUnhealthy, RpcPoolSaturated, RPCHealthCheckDown. - trivela_indexer: IndexerLag (cursor stalled > 10 m). - trivela_contracts: ContractPaused, CampaignDBWriteErrors. - trivela_dlq: DLQGrowth. - trivela_operator: OperatorLowBalance (< 5 XLM). - trivela_canary: CanaryJourneyFailed, CanarySlowJourney (> 30 s). **promtool unit tests** (monitoring/alerting/alerting_rules_test.yml) - 10 test cases covering all primary alerts (both fire and no-fire paths). **Alertmanager** (monitoring/alertmanager.yml) - Routes critical→#trivela-critical + PagerDuty, contracts→#trivela-contracts, everything else→#trivela-alerts. Inhibition rules suppress child alerts when BackendDown or AllRpcEndpointsUnhealthy fires. **Grafana dashboards-as-code** (monitoring/dashboards/) - trivela-api.json: request rate, error rate, p50/p95/p99 latency, auth events, backend up/uptime. - trivela-rpc-pools.json: pool saturation gauges (in-use, waiting, idle), endpoint health time series. - Provisioning configs: grafana/provisioning/datasources/prometheus.yml and grafana/provisioning/dashboards/trivela.yml for zero-config Grafana startup. **Synthetic canary** (scripts/canary.mjs) - Five-step journey: health check → create canary campaign → credit claimant → verify stats → cleanup. Emits trivela_canary_success, _duration_seconds, _last_run_timestamp in Prometheus text format. - Designed for cron/every-5-min scheduling; CANARY_METRICS_FILE writes to a file for node-exporter textfile collector instead of stdout. **SLO definitions** (docs/SLO.md) - Availability (99.5% API, 99.9% reachability, 99.0% RPC), latency (p95 ≤ 1 s), indexer freshness (cursor advance every 10 m), pool saturation, canary, and operator balance SLOs. Error budget policy and measurement pointers. **CI** (.github/workflows/observability-ci.yml) - Runs promtool check rules + promtool test rules on every monitoring/ change. - Syntax-checks the canary script and dry-runs it with a 2 s timeout. - Runs backend unit tests scoped to the backend package via turbo.
|
@CelestinaBeing is attempting to deploy a commit to the joelpeace48-cell's projects Team on Vercel. A member of the Team first needs to authorize it. |
…ilure
**Prettier formatting (3 files re-formatted)**
The format-check workflow runs prettier over backend/src/**/*.{js,json} and
**/*.{md,yaml,yml}. The following files written in the observability commit
did not match the repo's prettier style and needed reformatting:
- backend/src/index.js
- .github/workflows/observability-ci.yml
- docs/SLO.md
- monitoring/alerting/alerting_rules.yml
- monitoring/alerting/alerting_rules_test.yml
- monitoring/alertmanager.yml
All files now pass `prettier --check`.
**PromQL parse error in OperatorLowBalance rule**
The expression used `50_000_000` (JavaScript numeric separator syntax).
PromQL does not support underscores in numeric literals, causing:
bad number or duration syntax: "50" (at line 216)
Fix: change to `50000000` (plain integer).
**Canary JSDoc prematurely closed by `*/` in cron example**
Line 17 of scripts/canary.mjs contained:
* # */5 * * * * node scripts/canary.mjs ...
The `*/` in the cron expression closed the surrounding `/** ... */` block
comment, leaving the rest of the line as bare JS code. Node saw:
SyntaxError: Unexpected token '*'
Fix: rewrite the scheduling note to avoid `*/` inside the block comment.
`node --check scripts/canary.mjs` now passes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #650. This PR delivers the full observability & runtime-reliability epic across seven deliverable areas: RPC pool saturation safety, per-request deadlines, Prometheus metrics + histograms, graceful shutdown, Prometheus alerting rules, Grafana dashboards-as-code, and a synthetic canary with SLO definitions.
What changed
1. RPC pool saturation safety (
backend/src/rpcPool.js)acquire()/release()concurrency gate with configurablemaxConcurrent(default 10) andacquireTimeoutMs(default 5 000 ms).PoolSaturatedError(code: 'POOL_SATURATED') rather than hanging indefinitely.getStatus()method exposesin_use,idle,waiting,healthy,unhealthyfor the/metricsendpoint.errorHandler.jsmapsPOOL_SATURATED→ HTTP 503 with{ error: 'Service temporarily unavailable', code: 'POOL_SATURATED' }.2. Per-request deadlines (
backend/src/middleware/timeout.js)requestTimeout(ms)middleware attaches anAbortControllersignal toreq.signalso downstream handlers (RPC calls, DB queries) can honour cancellation.504 REQUEST_TIMEOUTbefore Express's default error boundary fires.index.jsatDEFAULT_REQUEST_TIMEOUT_MS = 30_000(overrideable viaREQUEST_TIMEOUT_MSenv var).3. Prometheus metrics + latency histogram (
backend/src/index.js)/metricsnow emits:trivela_http_request_duration_ms_bucket{le="N"}histogram_quantiletrivela_http_request_duration_ms_sumtrivela_http_request_duration_ms_counttrivela_rpc_pool_in_usetrivela_rpc_pool_idletrivela_rpc_pool_waitingtrivela_rpc_pool_healthytrivela_rpc_pool_unhealthyLatency is recorded in a
res.on('finish')hook on every request.4. Graceful shutdown (
backend/src/index.js)SIGTERM/SIGINThandlers callgracefulShutdown(signal).SHUTDOWN_GRACE_MS(default 15 s) → flush OpenTelemetry viashutdownTracing()→process.exit(0).5. Prometheus alerting rules (
monitoring/alerting/alerting_rules.yml)17 alert rules across 6 groups covering all 7 epic requirements:
HighBackendErrorRate,HighP95Latency,BackendDown,BackendProcessRestart,AuthFailureSpike,AuthLockoutTriggered,AllRpcEndpointsUnhealthy,RpcPoolSaturated,RPCHealthCheckDown,IndexerLag,ContractPaused,CampaignDBWriteErrors,DLQGrowth,OperatorLowBalance,CanaryJourneyFailed,CanarySlowJourney.promtool unit tests (
monitoring/alerting/alerting_rules_test.yml) cover 10 scenarios (both fire and no-fire paths).6. Grafana dashboards-as-code (
monitoring/dashboards/)trivela-api.json— request rate, error rate, p50/p95/p99 latency, auth events, backend up/uptime.trivela-rpc-pools.json— pool saturation gauges with colour thresholds, endpoint health series.monitoring/grafana/provisioning/wire datasource + dashboards on container startup (zero manual import).monitoring/alertmanager.yml— routes critical to#trivela-critical+ PagerDuty, inhibits child alerts on parent failures.compose.monitoring.yml(root) — Docker Compose overlay for Prometheus, Grafana, Alertmanager, node-exporter.7. Synthetic canary (
scripts/canary.mjs)Five-step journey: health check → create canary campaign → credit claimant → verify stats → cleanup. Emits
trivela_canary_success,trivela_canary_duration_seconds,trivela_canary_last_run_timestampin Prometheus text format. Designed for*/5 * * * *cron or GitHub Actions schedule.8. SLO definitions (
docs/SLO.md)Availability (≥ 99.5% API, ≥ 99.9% reachability, ≥ 99.0% RPC), latency (p95 ≤ 1 000 ms), indexer freshness (cursor advance every 10 min), canary (success every 5 min, ≤ 30 s), operator balance (≥ 5 XLM). Error budget policy and measurement pointers.
9. CI (
.github/workflows/observability-ci.yml)promtool check rules+promtool test rules+ canary syntax/dry-run + backend unit tests via turbo — runs on everymonitoring/**orscripts/canary.mjschange.Test plan
promtool test rules monitoring/alerting/alerting_rules_test.ymlpasses (10/10 cases)node --check scripts/canary.mjspassesdocker compose -f compose.yaml -f compose.monitoring.yml upbrings up Prometheus, Grafana, Alertmanager:3001shows Trivela API and RPC dashboards auto-provisioned/metricsendpoint includes histogram buckets and pool gaugescode: REQUEST_TIMEOUTacquireTimeoutMsreturns 503 withcode: POOL_SATURATED