-
Notifications
You must be signed in to change notification settings - Fork 70
[EPIC] Production-grade observability & runtime reliability (mainnet blocker) #650
Copy link
Copy link
Open
Labels
GrantFox OSSMaybe RewardedIssue may be eligible for a GrantFox rewardIssue may be eligible for a GrantFox rewardOfficial CampaignCampaign: Official CampaignCampaign: Official Campaignarea: backendBackend API (Node/Express)Backend API (Node/Express)difficulty: hardLarger or subtle changesLarger or subtle changesenhancementNew feature or requestNew feature or requestepicLarge initiative bundling multiple work itemsLarge initiative bundling multiple work itemsinfraDeployment, docker, runtimeDeployment, docker, runtimeobservabilityLogs, metrics, tracingLogs, metrics, tracingpriority: highHigh-priority, high-impact workHigh-priority, high-impact work
Metadata
Metadata
Assignees
Labels
GrantFox OSSMaybe RewardedIssue may be eligible for a GrantFox rewardIssue may be eligible for a GrantFox rewardOfficial CampaignCampaign: Official CampaignCampaign: Official Campaignarea: backendBackend API (Node/Express)Backend API (Node/Express)difficulty: hardLarger or subtle changesLarger or subtle changesenhancementNew feature or requestNew feature or requestepicLarge initiative bundling multiple work itemsLarge initiative bundling multiple work itemsinfraDeployment, docker, runtimeDeployment, docker, runtimeobservabilityLogs, metrics, tracingLogs, metrics, tracingpriority: highHigh-priority, high-impact workHigh-priority, high-impact work
Type
Fields
Give feedbackNo fields configured for issues without a type.
Why this matters (growth/scale)
A platform meant for thousands of users cannot grow on top of a system that's blind to its own failures. Reliability is a growth feature: every outage or silent failure during a high-traffic campaign launch burns user trust and operator confidence. This initiative makes Trivela observable and self-defending so it can scale without surprise outages.
Goal
Ship production-grade observability + runtime reliability: dashboards, alerts, a live canary, request deadlines, pool/saturation visibility, and graceful shutdown — wired to SLOs.
Scope (merged work items)
observability/dashboards/*) for API, DB/pools, RPC/breaker, and indexer; provisioned + reproducible. (was feat: Grafana dashboards as code (API, DB, RPC, indexer, KPIs) #577, feat: Indexer observability dashboard (lag, throughput, errors) #563)promtooltests + routing. (was feat: Prometheus alerting rules (latency, errors, RPC, indexer lag, pools) #576)register→credit→claimcanary on testnet + uptime checks; alert within minutes on failure (also flags stale README contract IDs). (was feat: Synthetic uptime + register→credit→claim canary monitoring #579)AbortSignalpropagation to DB/RPC/storage; abort on client disconnect; 504 on deadline. (was feat: Per-route request timeouts, deadlines & cancellation propagation #570)pool_in_use/idle/waitingmetrics; acquire timeouts; typed503 POOL_SATURATEDinstead of hanging. (was feat: Connection-pool saturation metrics & fast-fail safeguards #571)Acceptance criteria
promtooltests pass).Verification
promtool test rulesin CI; force a canary failure; load test driving pool saturation; SIGTERM-under-load deploy test.Priority: high · Difficulty: hard · Effort: L · mainnet blocker