Add detailed health endpoint with raindex sync status#95
Add detailed health endpoint with raindex sync status#95alastairong1 wants to merge 1 commit intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughA new detailed health endpoint GET /health/detailed was added that returns structured app DB and raindex sync status. New typed health models were introduced. OpenAPI registration was updated to include the new route. The rain.orderbook submodule and related test fixtures/bindings were bumped for compatibility; swap code updated to match the bumped API. ChangesDetailed Health Endpoint Implementation
Library Compatibility & Fixtures
Sequence DiagramsequenceDiagram
actor Client
participant Handler as GET /health/detailed
participant DbPool as App DbPool
participant RaindexProvider as SharedRaindex Provider
participant StatusLogic as Status Computation
Client->>Handler: GET /health/detailed
Handler->>DbPool: SELECT 1 (connectivity check)
DbPool-->>Handler: ok / error
Handler->>RaindexProvider: read LocalDbSyncSnapshot
RaindexProvider-->>Handler: snapshot / error
Handler->>StatusLogic: map_raindex_snapshot(), assemble DbStatus
StatusLogic-->>Handler: overall status (ok / degraded / error)
Handler-->>Client: DetailedHealthResponse { status, app_db, raindex }
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/routes/health.rs`:
- Around line 56-70: Before returning the DetailedHealthResponse, add a tracing
info log that records the computed status and the component booleans/metrics
(e.g., app_db.connected, raindex_db.connected, and raindex_db.orderbooks.len())
so the decision context is observable; place this log immediately after the
status is determined (referencing the local variable status and the locals
app_db and raindex_db) and before the Ok(Json(DetailedHealthResponse { ... }))
return, using tracing::info! with a clear message and structured fields.
- Around line 211-213: The code is directly casting a potentially negative DB
value `block` to u64 in the `Ok((block, updated_at))` match arm which can
produce an invalid block number; update that match arm (where
`info.last_synced_block` is set) to guard against negative values by checking
`if block < 0` and in that case set `info.last_synced_block` to a safe value
(e.g., 0) or handle it as None, otherwise cast `block as u64`; also consider
emitting a warning/log when a negative value is observed so the issue can be
traced, leaving `info.updated_at` assignment unchanged.
- Around line 76-87: The health endpoint is leaking internal errors and
filesystem paths: update check_app_db (and any other health helpers that
construct DbStatus and include error strings or db_path) to never return raw
e.to_string() or filesystem paths to clients; instead set DbStatus.error to a
generic message like "database unavailable" or None and remove/omit db_path from
the returned health payload, while still logging the full error and path
internally with tracing::warn!(error=%e, db_path=%path) so operators see details
but clients only get non-sensitive, generic status values.
- Around line 223-231: Per-orderbook SQL errors are currently swallowed in the
Err(e) arm where you only call tracing::warn! and then continue returning
Ok(results), causing connected: true to be reported despite failures; update the
error handling in the match that inspects sync_status (the branch that currently
logs tracing::warn! with error = %e, chain_id and orderbook = %ob_address) to
propagate or surface the error instead of silently continuing — either return an
Err from that function or set the per-orderbook health flag to false and include
that in results so overall connected becomes false; ensure the same change is
applied to the other identical error branches around the sync_status handling
(the other tracing::warn! places) so SQL errors no longer result in a falsely
healthy raindex DB.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a00073f4-d67a-4fc2-b4b2-ce299d61cf5f
📒 Files selected for processing (3)
src/main.rssrc/routes/health.rssrc/types/health.rs
- docs/ops.md: journalctl + curl recipes for the deployed service, cache_warmer status interpretation, smoke-test usage - scripts/smoke.sh: end-to-end correctness + latency probe against the API, asserts on response shape (not just status), reports per-check latency - scripts/uptimerobot-setup.sh: one-shot creation of the 3 baseline UptimeRobot monitors (liveness, /health/detailed status=ok, cache_warmer running=true) Notes: - The smoke + UptimeRobot scripts reference /health/detailed.cache_warmer which only lights up once the detailed-health (#95) and cache-warmer PRs land. They will return FAIL/keyword-miss against current main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
How to use the Graphite Merge QueueAdd the label add-to-gt-merge-queue to this PR to add it to the merge queue. You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
8b055a6 to
2d85560
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/routes/health.rs (1)
50-64: ⚡ Quick winRun the two health probes in parallel.
check_app_db()andcheck_raindex_db()are independent, but this handler awaits them sequentially. That makes every/health/detailedscrape pay the sum of both latencies instead of just the slower check.♻️ Proposed fix
- // 1. Check app database connectivity - let app_db = check_app_db(pool).await; - - // 2. Check raindex database and sync status - let raindex_db = check_raindex_db(shared_raindex).await; + // Check both subsystems concurrently. + let (app_db, raindex_db) = tokio::join!( + check_app_db(pool), + check_raindex_db(shared_raindex) + );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/routes/health.rs` around lines 50 - 64, Run the two independent probes concurrently instead of awaiting sequentially: invoke check_app_db(pool) and check_raindex_db(shared_raindex) without awaiting, then await them together (e.g., using tokio::join! or futures::join) so you get app_db and raindex_db results in parallel; afterwards use the existing status logic that inspects app_db.connected, raindex_db.connected, and raindex_db.orderbooks to compute "ok"/"degraded"/"error".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/routes/health.rs`:
- Around line 50-64: Run the two independent probes concurrently instead of
awaiting sequentially: invoke check_app_db(pool) and
check_raindex_db(shared_raindex) without awaiting, then await them together
(e.g., using tokio::join! or futures::join) so you get app_db and raindex_db
results in parallel; afterwards use the existing status logic that inspects
app_db.connected, raindex_db.connected, and raindex_db.orderbooks to compute
"ok"/"degraded"/"error".
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 52aaf6ef-fb21-4fa4-8d65-8d183b4e6bd9
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
Cargo.tomlsrc/main.rssrc/routes/health.rssrc/types/health.rs
✅ Files skipped from review due to trivial changes (2)
- src/main.rs
- Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (1)
- src/types/health.rs
07df385 to
5f14c7d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/routes/health.rs (1)
85-91:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSanitize client-facing error strings in health payloads
Both app DB and raindex branches return raw internal errors to clients (
e.to_string()/ interpolated{e}). Keep full details in logs, but return generic messages in the JSON payload.Suggested hardening patch
Err(e) => { tracing::warn!(error = %e, "app database health check failed"); DbStatus { status: STATUS_ERROR.to_string(), connected: false, - error: Some(e.to_string()), + error: Some("database unavailable".to_string()), } } @@ Err(e) => { tracing::warn!(error = %e, "failed to get raindex local db sync snapshot"); RaindexSyncStatus { status: STATUS_ERROR.to_string(), configured: false, healthy: false, scheduler_state: scheduler_state_label(SchedulerState::Leader).to_string(), - error: Some(format!("failed to get raindex local db sync snapshot: {e}")), + error: Some("raindex local db sync snapshot unavailable".to_string()), networks: vec![], orderbooks: vec![], } }Also applies to: 100-108
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/routes/health.rs` around lines 85 - 91, The health endpoint currently returns raw internal errors to clients (e.g., DbStatus { error: Some(e.to_string()) }) — keep full error details in logs (tracing::warn!(error = %e, ...)) but replace the client-facing error payload with a generic message; update both branches that construct DbStatus on Err (the app DB branch that builds DbStatus with STATUS_ERROR and the raindex branch around the similar construction) to set error to Some("internal server error") or None (your standard generic health error string) instead of e.to_string(), leaving tracing::warn! unchanged so logs still contain %e.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/types/health.rs`:
- Around line 100-146: Add a new field to the OrderbookSyncInfo struct named
something like latest_trade_age_seconds: Option<u64> (or latest_trade_age_ms
depending on project convention) with appropriate serde skip_serializing_if and
a schema/example annotation, then populate this field where OrderbookSyncInfo
instances are constructed (update the mapping function that builds
OrderbookSyncInfo from internal orderbook state / DB to calculate and assign the
latest trade age), and finally extend the /health/detailed tests to assert the
new field is present and correct for each orderbook; refer to OrderbookSyncInfo
and the mapping function that returns OrderbookSyncInfo instances and the
existing health/detailed test files to locate placements.
---
Duplicate comments:
In `@src/routes/health.rs`:
- Around line 85-91: The health endpoint currently returns raw internal errors
to clients (e.g., DbStatus { error: Some(e.to_string()) }) — keep full error
details in logs (tracing::warn!(error = %e, ...)) but replace the client-facing
error payload with a generic message; update both branches that construct
DbStatus on Err (the app DB branch that builds DbStatus with STATUS_ERROR and
the raindex branch around the similar construction) to set error to
Some("internal server error") or None (your standard generic health error
string) instead of e.to_string(), leaving tracing::warn! unchanged so logs still
contain %e.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4d2d7dd1-2837-463c-9151-586fadc38a8a
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (8)
lib/rain.orderbooksrc/main.rssrc/routes/health.rssrc/routes/order/mod.rssrc/routes/swap/mod.rssrc/routes/tokens.rssrc/test_helpers.rssrc/types/health.rs
✅ Files skipped from review due to trivial changes (3)
- lib/rain.orderbook
- src/routes/order/mod.rs
- src/main.rs
5f14c7d to
4d69f7b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/types/health.rs (1)
100-146:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift
OrderbookSyncInfois still missing thelatest_trade_age_secondsfield.The PR objectives explicitly list "latest trade age" per orderbook as a core diagnostic for detecting sync stalls (e.g. the Apr 17 stoppage). The field is absent from the struct and not populated in
map_orderbook_snapshot. The test at line 288 also has no assertion for it.📋 Suggested struct addition
/// Timestamp when raindex last updated the persisted sync status. #[serde(skip_serializing_if = "Option::is_none")] pub updated_at: Option<String>, + + /// Age of the most recent trade for this orderbook, in seconds. + /// None if no trades have been recorded yet. + #[schema(example = 42)] + #[serde(skip_serializing_if = "Option::is_none")] + pub latest_trade_age_seconds: Option<u64>, /// Error message if this orderbook failed. #[serde(skip_serializing_if = "Option::is_none")] pub error: Option<String>,Populate this in
map_orderbook_snapshotand extend the snapshot test to assert the value.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/types/health.rs` around lines 100 - 146, The OrderbookSyncInfo struct is missing the latest_trade_age_seconds field required by the PR; add pub latest_trade_age_seconds: Option<u64> (with #[serde(skip_serializing_if = "Option::is_none")] and a schema/example annotation) to OrderbookSyncInfo, then compute and populate that field in map_orderbook_snapshot (use the latest trade timestamp to compute age in seconds or None if no trades), and finally update the snapshot test (the test referencing map_orderbook_snapshot) to assert the expected latest_trade_age_seconds value is present.src/routes/health.rs (1)
86-93:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRaw DB error string still exposed to clients in
check_app_db.Line 91 passes
e.to_string()directly into the response body. Same pattern appears at line 108 incheck_raindex_db. Sanitised error strings should go to the client; the full error should go to the tracing log.🔒 Suggested fix for check_app_db
Err(e) => { - tracing::warn!(error = %e, "app database health check failed"); + tracing::warn!(error = %e, "app database health check failed"); // full error retained in logs DbStatus { status: STATUS_ERROR.to_string(), connected: false, - error: Some(e.to_string()), + error: Some("database unavailable".into()), } }🔒 Suggested fix for check_raindex_db error arm (line 108)
- error: Some(format!("failed to get raindex local db sync snapshot: {e}")), + error: Some("failed to get raindex sync snapshot".into()),Log
{e}in the existingtracing::warn!at line 102 (already done) and keep the client-facing message generic.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/routes/health.rs` around lines 86 - 93, In check_app_db and check_raindex_db update the error arms so you do not forward the raw e.to_string() to clients: keep the existing tracing::warn!(error = %e, ...) to log the full error, but change the DbStatus.error field to a sanitized/generic message (e.g. "database error" or the error kind) instead of e.to_string(); modify the error arms that construct DbStatus (in check_app_db and check_raindex_db) to supply the generic string while leaving the trace log unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/routes/health.rs`:
- Around line 101-113: The error branch for get_local_db_sync_snapshot currently
hardcodes scheduler_state_label(SchedulerState::Leader) which misrepresents the
real state; change the error path to report an unknown scheduler state by either
(a) adding an Unknown variant to the SchedulerState enum and a corresponding arm
in scheduler_state_label, then use
scheduler_state_label(SchedulerState::Unknown) in the Err(e) branch, or (b) if
an Unknown variant already exists, replace SchedulerState::Leader with
SchedulerState::Unknown in the RaindexSyncStatus construction (symbols:
get_local_db_sync_snapshot, RaindexSyncStatus, scheduler_state_label,
SchedulerState). Ensure scheduler_state_label covers the new Unknown case so the
health response no longer falsely reports "leader".
---
Duplicate comments:
In `@src/routes/health.rs`:
- Around line 86-93: In check_app_db and check_raindex_db update the error arms
so you do not forward the raw e.to_string() to clients: keep the existing
tracing::warn!(error = %e, ...) to log the full error, but change the
DbStatus.error field to a sanitized/generic message (e.g. "database error" or
the error kind) instead of e.to_string(); modify the error arms that construct
DbStatus (in check_app_db and check_raindex_db) to supply the generic string
while leaving the trace log unchanged.
In `@src/types/health.rs`:
- Around line 100-146: The OrderbookSyncInfo struct is missing the
latest_trade_age_seconds field required by the PR; add pub
latest_trade_age_seconds: Option<u64> (with #[serde(skip_serializing_if =
"Option::is_none")] and a schema/example annotation) to OrderbookSyncInfo, then
compute and populate that field in map_orderbook_snapshot (use the latest trade
timestamp to compute age in seconds or None if no trades), and finally update
the snapshot test (the test referencing map_orderbook_snapshot) to assert the
expected latest_trade_age_seconds value is present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9a4b2d6f-a90a-46cc-bb7e-99fc2035f9f1
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (8)
lib/rain.orderbooksrc/main.rssrc/routes/health.rssrc/routes/order/mod.rssrc/routes/swap/mod.rssrc/routes/tokens.rssrc/test_helpers.rssrc/types/health.rs
✅ Files skipped from review due to trivial changes (3)
- src/routes/tokens.rs
- src/main.rs
- src/routes/order/mod.rs
🚧 Files skipped from review as they are similar to previous changes (3)
- lib/rain.orderbook
- src/routes/swap/mod.rs
- src/test_helpers.rs
b5b41f5 to
792a615
Compare
- docs/ops.md: journalctl + curl recipes for the deployed service, cache_warmer status interpretation, smoke-test usage - scripts/smoke.sh: end-to-end correctness + latency probe against the API, asserts on response shape (not just status), reports per-check latency - scripts/uptimerobot-setup.sh: one-shot creation of the 3 baseline UptimeRobot monitors (liveness, /health/detailed status=ok, cache_warmer running=true) Notes: - The smoke + UptimeRobot scripts reference /health/detailed.cache_warmer which only lights up once the detailed-health (#95) and cache-warmer PRs land. They will return FAIL/keyword-miss against current main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
792a615 to
dc10129
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/routes/health.rs`:
- Around line 87-103: In check_raindex_db, avoid holding the read lock on
SharedRaindexProvider across the await by extracting the client before the async
call: acquire the read guard, get the client via
shared_raindex.read().await.client(), clone or otherwise take ownership of the
RaindexClient (or copy necessary data) so the guard can be dropped (drop the
guard) before calling client.get_local_db_sync_snapshot().await, then call
map_raindex_snapshot on the snapshot; if RaindexClient isn’t Clone, restructure
to move the minimal owned data out of the guard prior to awaiting.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fa504f47-6a55-4089-b548-48f9642c3b8e
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (8)
lib/rain.orderbooksrc/main.rssrc/routes/health.rssrc/routes/order/mod.rssrc/routes/swap/mod.rssrc/routes/tokens.rssrc/test_helpers.rssrc/types/health.rs
✅ Files skipped from review due to trivial changes (4)
- lib/rain.orderbook
- src/routes/order/mod.rs
- src/main.rs
- src/routes/tokens.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/routes/swap/mod.rs
- docs/ops.md: journalctl + curl recipes for the deployed service, cache_warmer status interpretation, smoke-test usage - scripts/smoke.sh: end-to-end correctness + latency probe against the API, asserts on response shape (not just status), reports per-check latency - scripts/uptimerobot-setup.sh: one-shot creation of the 3 baseline UptimeRobot monitors (liveness, /health/detailed status=ok, cache_warmer running=true) Notes: - The smoke + UptimeRobot scripts reference /health/detailed.cache_warmer which only lights up once the detailed-health (#95) and cache-warmer PRs land. They will return FAIL/keyword-miss against current main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dc10129 to
6e0d829
Compare

Motivation
GET /healthonly proves that the API process is up. It does not show whether raindex local DB sync is configured, actively syncing, ready for reads, or failing.The detailed endpoint gives monitoring and operators an on-demand snapshot without this API needing to know raindex local DB internals.
Solution
GET /health/detailedwhile keeping the existingGET /healthliveness response unchanged.lib/rain.orderbookto0f2f3d19212e6682c70e52a335d66988a10083a6and use the newget_local_db_sync_snapshot()API for raindex local DB sync status.raindex.client().get_local_db_sync_snapshot().awaitinstead of opening/querying SQLite directly.rusqlitedependency from this crate.active,syncing,failure, ornot_configured; top-level API health remainsok,degraded, orerror.Chained PRs
alastair/ops-tooling) is stacked on this PR and should be merged after this PR.Dependent PRs
lib/rain.orderbookto a commit that depends on that raindex change.Checks
nix develop -c cargo fmtnix develop -c cargo testnix develop -c rainix-rs-staticPart of the deployed-state split tracked in #94.
Summary by CodeRabbit