Skip to content

feat(observability): production-grade metrics, alerting, synthetic canary & SLOs (#650)#665

Open
CelestinaBeing wants to merge 2 commits into
FinesseStudioLab:mainfrom
CelestinaBeing:feat/observability-reliability-650
Open

feat(observability): production-grade metrics, alerting, synthetic canary & SLOs (#650)#665
CelestinaBeing wants to merge 2 commits into
FinesseStudioLab:mainfrom
CelestinaBeing:feat/observability-reliability-650

Conversation

@CelestinaBeing

Copy link
Copy Markdown

Summary

Closes #650. This PR delivers the full observability & runtime-reliability epic across seven deliverable areas: RPC pool saturation safety, per-request deadlines, Prometheus metrics + histograms, graceful shutdown, Prometheus alerting rules, Grafana dashboards-as-code, and a synthetic canary with SLO definitions.


What changed

1. RPC pool saturation safety (backend/src/rpcPool.js)

  • Added acquire() / release() concurrency gate with configurable maxConcurrent (default 10) and acquireTimeoutMs (default 5 000 ms).
  • Callers that wait past the timeout receive a typed PoolSaturatedError (code: 'POOL_SATURATED') rather than hanging indefinitely.
  • New getStatus() method exposes in_use, idle, waiting, healthy, unhealthy for the /metrics endpoint.
  • errorHandler.js maps POOL_SATURATED → HTTP 503 with { error: 'Service temporarily unavailable', code: 'POOL_SATURATED' }.

2. Per-request deadlines (backend/src/middleware/timeout.js)

  • requestTimeout(ms) middleware attaches an AbortController signal to req.signal so downstream handlers (RPC calls, DB queries) can honour cancellation.
  • Client disconnects cancel the signal immediately (no wasted work).
  • Wall-clock breach returns 504 REQUEST_TIMEOUT before Express's default error boundary fires.
  • Wired globally in index.js at DEFAULT_REQUEST_TIMEOUT_MS = 30_000 (overrideable via REQUEST_TIMEOUT_MS env var).

3. Prometheus metrics + latency histogram (backend/src/index.js)

/metrics now emits:

Metric Type Description
trivela_http_request_duration_ms_bucket{le="N"} histogram Request duration across 9 buckets (50→30 000 ms + +Inf) enabling histogram_quantile
trivela_http_request_duration_ms_sum counter
trivela_http_request_duration_ms_count counter
trivela_rpc_pool_in_use gauge Active pool slots
trivela_rpc_pool_idle gauge Available pool slots
trivela_rpc_pool_waiting gauge Callers blocked on acquire
trivela_rpc_pool_healthy gauge Reachable RPC endpoints
trivela_rpc_pool_unhealthy gauge Quarantined RPC endpoints

Latency is recorded in a res.on('finish') hook on every request.

4. Graceful shutdown (backend/src/index.js)

  • SIGTERM / SIGINT handlers call gracefulShutdown(signal).
  • Sequence: stop accepting connections → force exit after SHUTDOWN_GRACE_MS (default 15 s) → flush OpenTelemetry via shutdownTracing()process.exit(0).

5. Prometheus alerting rules (monitoring/alerting/alerting_rules.yml)

17 alert rules across 6 groups covering all 7 epic requirements:
HighBackendErrorRate, HighP95Latency, BackendDown, BackendProcessRestart, AuthFailureSpike, AuthLockoutTriggered, AllRpcEndpointsUnhealthy, RpcPoolSaturated, RPCHealthCheckDown, IndexerLag, ContractPaused, CampaignDBWriteErrors, DLQGrowth, OperatorLowBalance, CanaryJourneyFailed, CanarySlowJourney.

promtool unit tests (monitoring/alerting/alerting_rules_test.yml) cover 10 scenarios (both fire and no-fire paths).

6. Grafana dashboards-as-code (monitoring/dashboards/)

  • trivela-api.json — request rate, error rate, p50/p95/p99 latency, auth events, backend up/uptime.
  • trivela-rpc-pools.json — pool saturation gauges with colour thresholds, endpoint health series.
  • Grafana provisioning configs in monitoring/grafana/provisioning/ wire datasource + dashboards on container startup (zero manual import).
  • monitoring/alertmanager.yml — routes critical to #trivela-critical + PagerDuty, inhibits child alerts on parent failures.
  • compose.monitoring.yml (root) — Docker Compose overlay for Prometheus, Grafana, Alertmanager, node-exporter.

7. Synthetic canary (scripts/canary.mjs)

Five-step journey: health check → create canary campaign → credit claimant → verify stats → cleanup. Emits trivela_canary_success, trivela_canary_duration_seconds, trivela_canary_last_run_timestamp in Prometheus text format. Designed for */5 * * * * cron or GitHub Actions schedule.

8. SLO definitions (docs/SLO.md)

Availability (≥ 99.5% API, ≥ 99.9% reachability, ≥ 99.0% RPC), latency (p95 ≤ 1 000 ms), indexer freshness (cursor advance every 10 min), canary (success every 5 min, ≤ 30 s), operator balance (≥ 5 XLM). Error budget policy and measurement pointers.

9. CI (.github/workflows/observability-ci.yml)

promtool check rules + promtool test rules + canary syntax/dry-run + backend unit tests via turbo — runs on every monitoring/** or scripts/canary.mjs change.


Test plan

  • promtool test rules monitoring/alerting/alerting_rules_test.yml passes (10/10 cases)
  • node --check scripts/canary.mjs passes
  • docker compose -f compose.yaml -f compose.monitoring.yml up brings up Prometheus, Grafana, Alertmanager
  • Grafana at :3001 shows Trivela API and RPC dashboards auto-provisioned
  • /metrics endpoint includes histogram buckets and pool gauges
  • Sending a request that takes > REQUEST_TIMEOUT_MS returns 504 with code: REQUEST_TIMEOUT
  • With pool full, next acquire after acquireTimeoutMs returns 503 with code: POOL_SATURATED
  • SIGTERM drains in-flight requests before process exits

FinesseStudioLab#650)

Implements the full observability & runtime-reliability epic:

**RPC pool safety**
- Add acquire/release concurrency gate to rpcPool.js (maxConcurrent=10,
  acquireTimeoutMs=5000). Waiters over the timeout receive a typed
  PoolSaturatedError (code POOL_SATURATED) so the error handler can
  return 503 instead of letting callers hang.
- getStatus() exposes in_use/idle/waiting/healthy/unhealthy for Prometheus.

**Request deadlines**
- New backend/src/middleware/timeout.js: requestTimeout(ms) attaches an
  AbortSignal to req.signal, propagates client-disconnect aborts, and
  returns 504 REQUEST_TIMEOUT when the wall-clock deadline fires.
- Wired globally in index.js (DEFAULT_REQUEST_TIMEOUT_MS=30000, overrideable
  via env REQUEST_TIMEOUT_MS).

**Prometheus metrics**
- /metrics now emits a full histogram for trivela_http_request_duration_ms
  (buckets: 50/100/200/500/1000/2000/5000/10000/30000/+Inf) enabling
  histogram_quantile p50/p95/p99 in Grafana and PromQL.
- RPC pool gauges: trivela_rpc_pool_in_use, _idle, _waiting, _healthy,
  _unhealthy — consumed by RpcPoolSaturated and AllRpcEndpointsUnhealthy alerts.

**Graceful shutdown**
- SIGTERM/SIGINT handler in startServer(): stops accepting new connections,
  forces exit after SHUTDOWN_GRACE_MS (default 15 s), flushes OpenTelemetry
  traces via shutdownTracing(), then exits cleanly.

**Prometheus alerting rules** (monitoring/alerting/alerting_rules.yml)
- trivela_backend: HighBackendErrorRate (>5% for 5 m, critical), HighP95Latency
  (p95 > 1 s), BackendProcessRestart, BackendDown, AuthFailureSpike,
  AuthLockoutTriggered.
- trivela_rpc: AllRpcEndpointsUnhealthy, RpcPoolSaturated, RPCHealthCheckDown.
- trivela_indexer: IndexerLag (cursor stalled > 10 m).
- trivela_contracts: ContractPaused, CampaignDBWriteErrors.
- trivela_dlq: DLQGrowth.
- trivela_operator: OperatorLowBalance (< 5 XLM).
- trivela_canary: CanaryJourneyFailed, CanarySlowJourney (> 30 s).

**promtool unit tests** (monitoring/alerting/alerting_rules_test.yml)
- 10 test cases covering all primary alerts (both fire and no-fire paths).

**Alertmanager** (monitoring/alertmanager.yml)
- Routes critical→#trivela-critical + PagerDuty, contracts→#trivela-contracts,
  everything else→#trivela-alerts. Inhibition rules suppress child alerts when
  BackendDown or AllRpcEndpointsUnhealthy fires.

**Grafana dashboards-as-code** (monitoring/dashboards/)
- trivela-api.json: request rate, error rate, p50/p95/p99 latency, auth events,
  backend up/uptime.
- trivela-rpc-pools.json: pool saturation gauges (in-use, waiting, idle),
  endpoint health time series.
- Provisioning configs: grafana/provisioning/datasources/prometheus.yml and
  grafana/provisioning/dashboards/trivela.yml for zero-config Grafana startup.

**Synthetic canary** (scripts/canary.mjs)
- Five-step journey: health check → create canary campaign → credit claimant →
  verify stats → cleanup. Emits trivela_canary_success, _duration_seconds,
  _last_run_timestamp in Prometheus text format.
- Designed for cron/every-5-min scheduling; CANARY_METRICS_FILE writes to a
  file for node-exporter textfile collector instead of stdout.

**SLO definitions** (docs/SLO.md)
- Availability (99.5% API, 99.9% reachability, 99.0% RPC), latency (p95 ≤ 1 s),
  indexer freshness (cursor advance every 10 m), pool saturation, canary, and
  operator balance SLOs. Error budget policy and measurement pointers.

**CI** (.github/workflows/observability-ci.yml)
- Runs promtool check rules + promtool test rules on every monitoring/ change.
- Syntax-checks the canary script and dry-runs it with a 2 s timeout.
- Runs backend unit tests scoped to the backend package via turbo.
@vercel

vercel Bot commented Jun 20, 2026

Copy link
Copy Markdown

@CelestinaBeing is attempting to deploy a commit to the joelpeace48-cell's projects Team on Vercel.

A member of the Team first needs to authorize it.

…ilure

**Prettier formatting (3 files re-formatted)**
The format-check workflow runs prettier over backend/src/**/*.{js,json} and
**/*.{md,yaml,yml}. The following files written in the observability commit
did not match the repo's prettier style and needed reformatting:
  - backend/src/index.js
  - .github/workflows/observability-ci.yml
  - docs/SLO.md
  - monitoring/alerting/alerting_rules.yml
  - monitoring/alerting/alerting_rules_test.yml
  - monitoring/alertmanager.yml
All files now pass `prettier --check`.

**PromQL parse error in OperatorLowBalance rule**
The expression used `50_000_000` (JavaScript numeric separator syntax).
PromQL does not support underscores in numeric literals, causing:

  bad number or duration syntax: "50" (at line 216)

Fix: change to `50000000` (plain integer).

**Canary JSDoc prematurely closed by `*/` in cron example**
Line 17 of scripts/canary.mjs contained:
  *   # */5 * * * *  node scripts/canary.mjs ...
The `*/` in the cron expression closed the surrounding `/** ... */` block
comment, leaving the rest of the line as bare JS code. Node saw:
  SyntaxError: Unexpected token '*'

Fix: rewrite the scheduling note to avoid `*/` inside the block comment.
`node --check scripts/canary.mjs` now passes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[EPIC] Production-grade observability & runtime reliability (mainnet blocker)

1 participant