Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ logs/
data/*
!data/.gitkeep
docs/book/
.claude/
38 changes: 34 additions & 4 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ rain-math-float = { path = "lib/rain.orderbook/lib/rain.interpreter/lib/rain.int
wasm-bindgen = "=0.2.100"
moka = { version = "0.12", features = ["future"] }
rusqlite = { version = "0.32" }
chrono = "0.4"
chrono-tz = "0.10"

[dev-dependencies]
tracing-test = "0.2"
Expand Down
16 changes: 16 additions & 0 deletions config/rest-api.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,19 @@ rate_limit_global_rpm = 600
rate_limit_per_key_rpm = 60
docs_dir = "/var/lib/st0x-docs"
local_db_path = "/mnt/data/st0x-rest-api/raindex.db"

# Replace the registry's single-URL `rpcs:` list with a pool of public Base
# RPCs. alloy's FallbackLayer (active_transport_count = 1, see
# `mk_read_provider`) health-routes to the best-scored transport and demotes
# any that 429 or error.
[rpc_override]
base = [
"https://mainnet.base.org",
"https://base.llamarpc.com",
"https://base.drpc.org",
"https://base-rpc.publicnode.com",
"https://base.meowrpc.com",
"https://base-mainnet.public.blastapi.io",
"https://base.gateway.tenderly.co",
"https://base.rpc.subquery.network/public",
]
92 changes: 92 additions & 0 deletions docs/ops.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Operations cheat sheet

Quick journalctl + curl recipes for the deployed `rest-api` service. SSH in with `nix develop -c remote` (or `ssh root@<host>` if your key is in `roles.ssh`).

## Service health

```bash
# Quick liveness probe (no auth)
curl -sS https://api.preview.st0x.io/health | jq

# Full status — includes db connectivity, raindex sync, cache_warmer
curl -sS https://api.preview.st0x.io/health/detailed | jq
```

Key fields in `/health/detailed.cache_warmer`:
- `running` — `false` until the warmer completes its first cycle (~15-30s after restart while caches are cold)
- `last_cycle_ms` — should track the steady-state cycle duration; sustained > 10s suggests upstream RPC slowness
- `seconds_since_last_complete` — should bounce between `0` and `~20` (cycle duration + REFRESH_INTERVAL); much higher means the warmer has frozen
- `last_errors` — per-token failures during the last cycle; non-zero is worth investigating

## Common journalctl queries

All queries run via `ssh root@api.preview.st0x.io '...'` or after `nix develop -c remote`.

### 429 rate

```bash
# Count in the last hour
journalctl -u rest-api --since '1 hour ago' --no-pager | grep -c 'error code 429'

# Per-RPC breakdown (when the backing RPC is identifiable from the error body)
journalctl -u rest-api --since '1 hour ago' --no-pager \
| grep -oE 'error code -32016|error code 429|StalePrice' \
| sort | uniq -c
```

### Cache warmer cycles

```bash
# Last 10 cycle durations + completion timestamps
journalctl -u rest-api --since '10 minutes ago' --no-pager \
| grep 'cache warmer: orders-by-token refresh complete' \
| sed -E 's/.*timestamp":"([^"]+)".*duration_ms":"?([0-9]+)"?.*/\1 cycle_ms=\2/' \
| tail -10
```

### ERROR-level rate

```bash
journalctl -u rest-api --since '5 minutes ago' --no-pager \
| grep -c 'level":"ERROR'
```

Most ERROR lines are benign (`No matching routes for HEAD /health` from external uptime checkers, or `task NNNN was cancelled` during graceful restart). Real signal:
- `failed to query orders` outside a deploy window
- `applied RPC override` should appear once on startup with the expected `url_count`

### Slow requests

```bash
# Requests > 5s in the last hour (raw rocket access logs)
journalctl -u rest-api --since '1 hour ago' --no-pager \
| grep 'request completed' \
| grep -oE 'duration_ms":[0-9]+\.[0-9]+' \
| awk -F: '$2 > 5000 { print }' \
| wc -l
```

## Smoke tests

```bash
# Run the smoke battery against the live preview
API_KEY=<id> API_SECRET=<secret> ./scripts/smoke.sh

# Override target
API_URL=https://api.st0x.io API_KEY=... API_SECRET=... ./scripts/smoke.sh
```

The script returns non-zero on any FAIL. Run post-deploy or wire into a cron + alert. SLOW (over `LATENCY_BUDGET_MS=3000`) is reported as a warning, not a failure.

## Suggested cron / external monitoring

A minimal external probe (run from any machine that can reach the public hostname):

```bash
# Run every 5 minutes; alert on non-zero exit or 502/503 in the body
*/5 * * * * cd /path/to/st0x.rest.api && \
API_KEY=... API_SECRET=... ./scripts/smoke.sh > /tmp/smoke.last 2>&1 || \
alert-channel "smoke failed: $(tail -5 /tmp/smoke.last)"
```

Higher-fidelity options (Prometheus + Grafana, Datadog, etc.) are deferred — the smoke + journalctl recipes cover most regressions for a single-instance preview.
8 changes: 8 additions & 0 deletions flake.nix
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,14 @@

specialArgs = {
docsRoot = self.packages.x86_64-linux.st0x-docs;
# Public hostname this box answers on. Drives the nginx vhost
# name and the ACME cert. Defaults to `api.st0x.io` for prod;
# override with `SITE_HOSTNAME` env var for preview / staging
# deploys (e.g. `SITE_HOSTNAME=api.preview.st0x.io`). Requires
# `--impure` (already passed by the deploy wrappers).
siteHostname =
let env = builtins.getEnv "SITE_HOSTNAME";
in if env == "" then "api.st0x.io" else env;
};

modules =
Expand Down
9 changes: 7 additions & 2 deletions os.nix
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{ pkgs, lib, modulesPath, docsRoot, ... }:
{ pkgs, lib, modulesPath, docsRoot, siteHostname, ... }:

let
inherit (import ./keys.nix) roles;
Expand Down Expand Up @@ -105,9 +105,14 @@ in {
# Rate-limit zone: 10 req/s per IP, burst 20
appendHttpConfig = ''
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

# UptimeRobot's keyword monitor operates on the raw response bytes
# without decompressing, so a gzip'd JSON response causes false
# "Keyword Not Found" alarms. Send uncompressed bodies to UR only.
gzip_disable "UptimeRobot";
'';

virtualHosts."api.st0x.io" = {
virtualHosts."${siteHostname}" = {
enableACME = true;
forceSSL = true;

Expand Down
145 changes: 145 additions & 0 deletions scripts/smoke.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
#!/usr/bin/env bash
# smoke.sh — End-to-end correctness + latency smoke tests against a deployed
# st0x-rest-api instance. Designed to be run post-deploy or on a cron.
#
# Usage:
# API_URL=https://api.preview.st0x.io \
# API_KEY=<key-id> API_SECRET=<secret> \
# ./scripts/smoke.sh
#
# Exits 0 if all checks pass, non-zero otherwise. Prints a summary with
# per-check status + latency. Uses only curl + jq.

set -uo pipefail

API_URL="${API_URL:-https://api.preview.st0x.io}"
API_KEY="${API_KEY:-}"
API_SECRET="${API_SECRET:-}"

# Tokens to probe. Override via env if the registry changes.
USDC_BASE="${SMOKE_USDC:-0x833589fcd6edb6e08f4c7c32d4f71b54bda02913}"
SAMPLE_OWNER="${SMOKE_OWNER:-0x71b94911fd1ce621fc40970450004c544e5287a8}"

# Latency budget per endpoint, in ms. Failures over budget are warnings, not
# hard failures, so a flaky network doesn't sink CI; tune if real regressions
# slip through.
LATENCY_BUDGET_MS="${LATENCY_BUDGET_MS:-3000}"

PASS=0
FAIL=0
WARN=0

color() {
case "$1" in
green) printf '\033[32m%s\033[0m' "$2" ;;
red) printf '\033[31m%s\033[0m' "$2" ;;
yellow) printf '\033[33m%s\033[0m' "$2" ;;
*) printf '%s' "$2" ;;
esac
}

# probe NAME METHOD PATH EXPECTED_STATUS [JQ_FILTER]
# The optional JQ_FILTER must produce a non-null, non-empty value for the
# check to pass — used to assert on response shape, not just status code.
probe() {
local name="$1"
local method="$2"
local path="$3"
local expected_status="$4"
local jq_filter="${5:-}"
local auth_header=""
if [[ -n "$API_KEY" && -n "$API_SECRET" ]]; then
auth_header="-u $API_KEY:$API_SECRET"
fi

local tmp
tmp=$(mktemp)
# shellcheck disable=SC2086
local result
result=$(curl -sS -X "$method" $auth_header \
-o "$tmp" \
-w '%{http_code} %{time_total}\n' \
--max-time 30 \
"$API_URL$path" 2>&1) || true

local status time_s
status=$(echo "$result" | awk '{print $1}')
time_s=$(echo "$result" | awk '{print $2}')
local time_ms
time_ms=$(awk -v t="$time_s" 'BEGIN { printf "%d", t * 1000 }')

local check_status="FAIL"
local detail=""

if [[ "$status" == "$expected_status" ]]; then
if [[ -n "$jq_filter" ]]; then
if jq -e "$jq_filter" >/dev/null 2>&1 < "$tmp"; then
check_status="PASS"
else
check_status="FAIL"
detail="(shape mismatch)"
fi
else
check_status="PASS"
fi
else
body=$(head -c 200 "$tmp")
detail="(got $status, body: $body)"
fi

rm -f "$tmp"

local latency_marker=""
if [[ "$check_status" == "PASS" && "$time_ms" -gt "$LATENCY_BUDGET_MS" ]]; then
latency_marker=" $(color yellow SLOW)"
WARN=$((WARN + 1))
fi

case "$check_status" in
PASS)
printf ' [%s] %-50s %4dms%s\n' "$(color green PASS)" "$name" "$time_ms" "$latency_marker"
PASS=$((PASS + 1))
;;
*)
printf ' [%s] %-50s %4dms %s\n' "$(color red FAIL)" "$name" "$time_ms" "$detail"
FAIL=$((FAIL + 1))
;;
esac
}

echo "smoke tests against $API_URL"
echo " budget per check: ${LATENCY_BUDGET_MS}ms"
echo

# 1. Public endpoints (no auth)
probe "GET /health" GET "/health" 200 '.status == "ok"'
probe "GET /health/detailed" GET "/health/detailed" 200 '.status'
probe "GET /health/detailed has cache_warmer" GET "/health/detailed" 200 '.cache_warmer'

# 2. Protected endpoints reject missing/invalid auth
SAVED_KEY="$API_KEY"; SAVED_SECRET="$API_SECRET"
API_KEY="" API_SECRET=""
probe "GET /v1/tokens (no auth)" GET "/v1/tokens" 401
API_KEY="$SAVED_KEY"; API_SECRET="$SAVED_SECRET"

# 3. Authenticated endpoints — only run if creds are set
if [[ -n "$API_KEY" && -n "$API_SECRET" ]]; then
probe "GET /v1/tokens" GET "/v1/tokens" 200 '.tokens | type == "array"'
probe "GET /v1/orders/token/{usdc}" GET "/v1/orders/token/$USDC_BASE" 200 '.orders | type == "array" and .pagination'
probe "GET /v1/orders/owner/{owner}" GET "/v1/orders/owner/$SAMPLE_OWNER" 200 '.orders | type == "array"'
probe "GET /v1/trades/token/{usdc}" GET "/v1/trades/token/$USDC_BASE?pageSize=10" 200 '.trades | type == "array"'
probe "GET /v1/trades/{owner}" GET "/v1/trades/$SAMPLE_OWNER?pageSize=10" 200 '.trades | type == "array"'
# Path validation only kicks in after auth succeeds — Rocket auth fairing
# runs first, so an invalid-address probe without auth would 401.
probe "GET /v1/orders/token/<bad>" GET "/v1/orders/token/not-an-address" 422
else
echo " (skipping authenticated checks; set API_KEY + API_SECRET to enable)"
fi

echo
echo "summary: $(color green "$PASS pass"), $(color red "$FAIL fail"), $(color yellow "$WARN slow")"

if [[ "$FAIL" -gt 0 ]]; then
exit 1
fi
exit 0
Loading