feat(observability): production-grade metrics, alerting, synthetic canary & SLOs (#650) by CelestinaBeing · Pull Request #665 · FinesseStudioLab/Trivela

CelestinaBeing · 2026-06-20T12:25:44Z

Summary

Closes #650. This PR delivers the full observability & runtime-reliability epic across seven deliverable areas: RPC pool saturation safety, per-request deadlines, Prometheus metrics + histograms, graceful shutdown, Prometheus alerting rules, Grafana dashboards-as-code, and a synthetic canary with SLO definitions.

What changed

1. RPC pool saturation safety (`backend/src/rpcPool.js`)

Added acquire() / release() concurrency gate with configurable maxConcurrent (default 10) and acquireTimeoutMs (default 5 000 ms).
Callers that wait past the timeout receive a typed PoolSaturatedError (code: 'POOL_SATURATED') rather than hanging indefinitely.
New getStatus() method exposes in_use, idle, waiting, healthy, unhealthy for the /metrics endpoint.
errorHandler.js maps POOL_SATURATED → HTTP 503 with { error: 'Service temporarily unavailable', code: 'POOL_SATURATED' }.

2. Per-request deadlines (`backend/src/middleware/timeout.js`)

requestTimeout(ms) middleware attaches an AbortController signal to req.signal so downstream handlers (RPC calls, DB queries) can honour cancellation.
Client disconnects cancel the signal immediately (no wasted work).
Wall-clock breach returns 504 REQUEST_TIMEOUT before Express's default error boundary fires.
Wired globally in index.js at DEFAULT_REQUEST_TIMEOUT_MS = 30_000 (overrideable via REQUEST_TIMEOUT_MS env var).

3. Prometheus metrics + latency histogram (`backend/src/index.js`)

/metrics now emits:

Metric	Type	Description
`trivela_http_request_duration_ms_bucket{le="N"}`	histogram	Request duration across 9 buckets (50→30 000 ms + +Inf) enabling `histogram_quantile`
`trivela_http_request_duration_ms_sum`	counter
`trivela_http_request_duration_ms_count`	counter
`trivela_rpc_pool_in_use`	gauge	Active pool slots
`trivela_rpc_pool_idle`	gauge	Available pool slots
`trivela_rpc_pool_waiting`	gauge	Callers blocked on acquire
`trivela_rpc_pool_healthy`	gauge	Reachable RPC endpoints
`trivela_rpc_pool_unhealthy`	gauge	Quarantined RPC endpoints

Latency is recorded in a res.on('finish') hook on every request.

4. Graceful shutdown (`backend/src/index.js`)

SIGTERM / SIGINT handlers call gracefulShutdown(signal).
Sequence: stop accepting connections → force exit after SHUTDOWN_GRACE_MS (default 15 s) → flush OpenTelemetry via shutdownTracing() → process.exit(0).

5. Prometheus alerting rules (`monitoring/alerting/alerting_rules.yml`)

17 alert rules across 6 groups covering all 7 epic requirements:
HighBackendErrorRate, HighP95Latency, BackendDown, BackendProcessRestart, AuthFailureSpike, AuthLockoutTriggered, AllRpcEndpointsUnhealthy, RpcPoolSaturated, RPCHealthCheckDown, IndexerLag, ContractPaused, CampaignDBWriteErrors, DLQGrowth, OperatorLowBalance, CanaryJourneyFailed, CanarySlowJourney.

promtool unit tests (monitoring/alerting/alerting_rules_test.yml) cover 10 scenarios (both fire and no-fire paths).

6. Grafana dashboards-as-code (`monitoring/dashboards/`)

trivela-api.json — request rate, error rate, p50/p95/p99 latency, auth events, backend up/uptime.
trivela-rpc-pools.json — pool saturation gauges with colour thresholds, endpoint health series.
Grafana provisioning configs in monitoring/grafana/provisioning/ wire datasource + dashboards on container startup (zero manual import).
monitoring/alertmanager.yml — routes critical to #trivela-critical + PagerDuty, inhibits child alerts on parent failures.
compose.monitoring.yml (root) — Docker Compose overlay for Prometheus, Grafana, Alertmanager, node-exporter.

7. Synthetic canary (`scripts/canary.mjs`)

Five-step journey: health check → create canary campaign → credit claimant → verify stats → cleanup. Emits trivela_canary_success, trivela_canary_duration_seconds, trivela_canary_last_run_timestamp in Prometheus text format. Designed for */5 * * * * cron or GitHub Actions schedule.

8. SLO definitions (`docs/SLO.md`)

Availability (≥ 99.5% API, ≥ 99.9% reachability, ≥ 99.0% RPC), latency (p95 ≤ 1 000 ms), indexer freshness (cursor advance every 10 min), canary (success every 5 min, ≤ 30 s), operator balance (≥ 5 XLM). Error budget policy and measurement pointers.

9. CI (`.github/workflows/observability-ci.yml`)

promtool check rules + promtool test rules + canary syntax/dry-run + backend unit tests via turbo — runs on every monitoring/** or scripts/canary.mjs change.

Test plan

promtool test rules monitoring/alerting/alerting_rules_test.yml passes (10/10 cases)
node --check scripts/canary.mjs passes
docker compose -f compose.yaml -f compose.monitoring.yml up brings up Prometheus, Grafana, Alertmanager
Grafana at :3001 shows Trivela API and RPC dashboards auto-provisioned
/metrics endpoint includes histogram buckets and pool gauges
Sending a request that takes > REQUEST_TIMEOUT_MS returns 504 with code: REQUEST_TIMEOUT
With pool full, next acquire after acquireTimeoutMs returns 503 with code: POOL_SATURATED
SIGTERM drains in-flight requests before process exits

FinesseStudioLab#650) Implements the full observability & runtime-reliability epic: **RPC pool safety** - Add acquire/release concurrency gate to rpcPool.js (maxConcurrent=10, acquireTimeoutMs=5000). Waiters over the timeout receive a typed PoolSaturatedError (code POOL_SATURATED) so the error handler can return 503 instead of letting callers hang. - getStatus() exposes in_use/idle/waiting/healthy/unhealthy for Prometheus. **Request deadlines** - New backend/src/middleware/timeout.js: requestTimeout(ms) attaches an AbortSignal to req.signal, propagates client-disconnect aborts, and returns 504 REQUEST_TIMEOUT when the wall-clock deadline fires. - Wired globally in index.js (DEFAULT_REQUEST_TIMEOUT_MS=30000, overrideable via env REQUEST_TIMEOUT_MS). **Prometheus metrics** - /metrics now emits a full histogram for trivela_http_request_duration_ms (buckets: 50/100/200/500/1000/2000/5000/10000/30000/+Inf) enabling histogram_quantile p50/p95/p99 in Grafana and PromQL. - RPC pool gauges: trivela_rpc_pool_in_use, _idle, _waiting, _healthy, _unhealthy — consumed by RpcPoolSaturated and AllRpcEndpointsUnhealthy alerts. **Graceful shutdown** - SIGTERM/SIGINT handler in startServer(): stops accepting new connections, forces exit after SHUTDOWN_GRACE_MS (default 15 s), flushes OpenTelemetry traces via shutdownTracing(), then exits cleanly. **Prometheus alerting rules** (monitoring/alerting/alerting_rules.yml) - trivela_backend: HighBackendErrorRate (>5% for 5 m, critical), HighP95Latency (p95 > 1 s), BackendProcessRestart, BackendDown, AuthFailureSpike, AuthLockoutTriggered. - trivela_rpc: AllRpcEndpointsUnhealthy, RpcPoolSaturated, RPCHealthCheckDown. - trivela_indexer: IndexerLag (cursor stalled > 10 m). - trivela_contracts: ContractPaused, CampaignDBWriteErrors. - trivela_dlq: DLQGrowth. - trivela_operator: OperatorLowBalance (< 5 XLM). - trivela_canary: CanaryJourneyFailed, CanarySlowJourney (> 30 s). **promtool unit tests** (monitoring/alerting/alerting_rules_test.yml) - 10 test cases covering all primary alerts (both fire and no-fire paths). **Alertmanager** (monitoring/alertmanager.yml) - Routes critical→#trivela-critical + PagerDuty, contracts→#trivela-contracts, everything else→#trivela-alerts. Inhibition rules suppress child alerts when BackendDown or AllRpcEndpointsUnhealthy fires. **Grafana dashboards-as-code** (monitoring/dashboards/) - trivela-api.json: request rate, error rate, p50/p95/p99 latency, auth events, backend up/uptime. - trivela-rpc-pools.json: pool saturation gauges (in-use, waiting, idle), endpoint health time series. - Provisioning configs: grafana/provisioning/datasources/prometheus.yml and grafana/provisioning/dashboards/trivela.yml for zero-config Grafana startup. **Synthetic canary** (scripts/canary.mjs) - Five-step journey: health check → create canary campaign → credit claimant → verify stats → cleanup. Emits trivela_canary_success, _duration_seconds, _last_run_timestamp in Prometheus text format. - Designed for cron/every-5-min scheduling; CANARY_METRICS_FILE writes to a file for node-exporter textfile collector instead of stdout. **SLO definitions** (docs/SLO.md) - Availability (99.5% API, 99.9% reachability, 99.0% RPC), latency (p95 ≤ 1 s), indexer freshness (cursor advance every 10 m), pool saturation, canary, and operator balance SLOs. Error budget policy and measurement pointers. **CI** (.github/workflows/observability-ci.yml) - Runs promtool check rules + promtool test rules on every monitoring/ change. - Syntax-checks the canary script and dry-runs it with a 2 s timeout. - Runs backend unit tests scoped to the backend package via turbo.

vercel · 2026-06-20T12:25:49Z

@CelestinaBeing is attempting to deploy a commit to the joelpeace48-cell's projects Team on Vercel.

A member of the Team first needs to authorize it.

…ilure **Prettier formatting (3 files re-formatted)** The format-check workflow runs prettier over backend/src/**/*.{js,json} and **/*.{md,yaml,yml}. The following files written in the observability commit did not match the repo's prettier style and needed reformatting: - backend/src/index.js - .github/workflows/observability-ci.yml - docs/SLO.md - monitoring/alerting/alerting_rules.yml - monitoring/alerting/alerting_rules_test.yml - monitoring/alertmanager.yml All files now pass `prettier --check`. **PromQL parse error in OperatorLowBalance rule** The expression used `50_000_000` (JavaScript numeric separator syntax). PromQL does not support underscores in numeric literals, causing: bad number or duration syntax: "50" (at line 216) Fix: change to `50000000` (plain integer). **Canary JSDoc prematurely closed by `*/` in cron example** Line 17 of scripts/canary.mjs contained: * # */5 * * * * node scripts/canary.mjs ... The `*/` in the cron expression closed the surrounding `/** ... */` block comment, leaving the rest of the line as bare JS code. Node saw: SyntaxError: Unexpected token '*' Fix: rewrite the scheduling note to avoid `*/` inside the block comment. `node --check scripts/canary.mjs` now passes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): production-grade metrics, alerting, synthetic canary & SLOs (#650)#665

feat(observability): production-grade metrics, alerting, synthetic canary & SLOs (#650)#665
CelestinaBeing wants to merge 2 commits into
FinesseStudioLab:mainfrom
CelestinaBeing:feat/observability-reliability-650

CelestinaBeing commented Jun 20, 2026

Uh oh!

vercel Bot commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CelestinaBeing commented Jun 20, 2026

Summary

What changed

1. RPC pool saturation safety (backend/src/rpcPool.js)

2. Per-request deadlines (backend/src/middleware/timeout.js)

3. Prometheus metrics + latency histogram (backend/src/index.js)

4. Graceful shutdown (backend/src/index.js)

5. Prometheus alerting rules (monitoring/alerting/alerting_rules.yml)

6. Grafana dashboards-as-code (monitoring/dashboards/)

7. Synthetic canary (scripts/canary.mjs)

8. SLO definitions (docs/SLO.md)

9. CI (.github/workflows/observability-ci.yml)

Test plan

Uh oh!

vercel Bot commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. RPC pool saturation safety (`backend/src/rpcPool.js`)

2. Per-request deadlines (`backend/src/middleware/timeout.js`)

3. Prometheus metrics + latency histogram (`backend/src/index.js`)

4. Graceful shutdown (`backend/src/index.js`)

5. Prometheus alerting rules (`monitoring/alerting/alerting_rules.yml`)

6. Grafana dashboards-as-code (`monitoring/dashboards/`)

7. Synthetic canary (`scripts/canary.mjs`)

8. SLO definitions (`docs/SLO.md`)

9. CI (`.github/workflows/observability-ci.yml`)