hotfix(jobs): xact-scoped advisory locks (unstick integrity-hash retry worker) by petterlindstrom79 · Pull Request #10 · strale-io/strale

petterlindstrom79 · 2026-04-17T18:50:00Z

Summary

Production hotfix for the stuck-advisory-lock bug that crippled the Phase C
integrity-hash retry worker post-deploy (zero batches in 15 min, /v1/audit/:id
stuck at 202). The same latent bug was present in three sibling jobs.

3 jobs → xact-scoped lock inside db.transaction(async (tx) => ...):
integrity-hash-retry.ts, activation-drip.ts, db-retention.ts.
Lock pins to the callback's connection, auto-releases at commit/rollback.
test-scheduler.ts → dedicated postgres client (max: 1) holding the
session lock. A 5–10 min poll cycle can't sit inside one transaction
(would rollback all test_result writes on any failure, starve the pool).
The dedicated client is not shared, so the pool-reuse gap cannot happen.
All four jobs now emit <job>-lock-busy logWarn on skip (structured,
Better Stack visible) instead of plain console.log.

Root cause: pg_try_advisory_lock (session-scoped) on a pooled connection.
Drizzle's getDb() borrows connections per statement; the lock sat on
connection A while subsequent queries borrowed B; pg_advisory_unlock
silently no-op'd on the wrong connection. Lock stayed held forever.

Test plan

npx tsc --noEmit clean
npx vitest run — 192 passed, 4 skipped (audit-token local-env
failure is pre-existing; CI sets AUDIT_HMAC_SECRET at workflow level)
grep pg_advisory_unlock / pg_try_advisory_lock — only remaining
call sites are test-scheduler's deliberate dedicated-connection
pattern + existing mocks in integrity-hash-retry.test.ts
CI green
Post-deploy: integrity-hash-batch-done log line within 40s of Railway
restart
Post-deploy: pg_locks shows no stuck advisory lock on 20260417,
20260402, 20260415, 314159
Post-deploy: compliance_hash_state = 'pending' count drains on the
30s cadence
Spot-check A: new /v1/do txn → /v1/audit/:id 202 → wait → 200 with
hash populated
Spot-check B: SSRF refusal (F-0-007)

Deferred: deep lock-busy unit tests across all four jobs require the Phase D
Testcontainers harness (real Postgres). Mocking pg_try_advisory_xact_lock
inside a drizzle transaction is a fragile proxy for the real bug this PR
fixes. Follow-up filed.

🤖 Generated with Claude Code

…ck lock Four background jobs held session-scoped pg_try_advisory_lock through drizzle's getDb() pool. On pool reuse, the lock sat on connection A while subsequent queries borrowed connection B — and the final pg_advisory_unlock silently no-op'd on the wrong connection. Locks stayed held indefinitely. Phase C's integrity-hash retry worker surfaced the bug in prod: zero batches processed for 15 minutes post-deploy, /v1/audit/:id stuck at 202, pg_locks snapshot showed the lock held by an idle PID. The three sibling jobs had the same latent bug — just not visible because nothing regularly probed them. Three jobs moved to xact-scoped locks inside db.transaction(async (tx) => ...): - integrity-hash-retry.ts (30s cadence, batch size 50) - activation-drip.ts (6h cadence, N small updates) - db-retention.ts (24h cadence, 5 DELETEs) The xact-lock variant pins the lock to the same connection as every statement in the callback and auto-releases at commit/rollback — the pool-reuse gap cannot happen. test-scheduler.ts diverged: a poll cycle runs 5–10 minutes iterating up to 20 capabilities with live external HTTP calls. Wrapping it in a transaction would (a) hold one pooled connection for the entire cycle, starving the live API, and (b) rollback every test_result write on any single failure, poisoning the SQS window. Instead it uses a dedicated postgres client (max: 1) scoped to the lock alone — all test work still flows through the regular pool and commits independently. Pool reuse is impossible because the lock's client is not shared. All four jobs now emit a structured `<job>-lock-busy` logWarn on skip rather than plain console.log, so multi-instance contention shows up in Better Stack. Verification: - npx tsc --noEmit → clean - npx vitest run → 192 passed, 4 skipped (audit-token local-env failure is pre-existing; CI sets AUDIT_HMAC_SECRET at workflow level) - grep pg_advisory_unlock / pg_try_advisory_lock → only remaining call sites are test-scheduler's deliberate dedicated-connection pattern and existing mocks in integrity-hash-retry.test.ts. Deferred: deep lock-busy unit tests across all four jobs require the Phase D Testcontainers harness (real Postgres) to exercise the actual advisory-lock semantics — mocking pg_try_advisory_xact_lock inside a drizzle transaction is a fragile proxy for the real bug this commit fixes. Tracked as a follow-up; not a blocker for the production hotfix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ler lock helper The F-0-009 no-bare-catch lint rule flagged the two .catch blocks in the dedicated-connection advisory-lock helper. A failure there is harmless (session ends implicitly, PG releases the lock), but the rule — rightly — demands a structured log so operators see anomalies instead of them vanishing silently. Swap both to `.catch((err) => logError(...))` with the lockId in context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Close out Phase C with the verified post-hotfix state. Fills every PENDING row in the original observations scaffold with real results: - Migrations: 0046 rate_limit_counters + 0047 compliance_hash_state both applied. 40,002 rows all 'complete', 0 pending, 0 failed. The other workflow's integrity_hash_status column is unchanged from the pre-deploy investigation baseline (39,510 / 287 / 150 / 55). - API boot: /health 200, every expected boot log line present, job startups visible at 18:55–18:57 UTC. - Log sink: Better Stack not configured on Railway; Pino → stdout → Railway logs. Shape correct. Bake monitoring runs via script, not dashboard rules — documented in BAKE_MONITORS.md. - Route gating: /v1/public/ops/* 200, /v1/internal/* 401. Admin wall active and public allowlist carries anonymous traffic. - Spot-check A (integrity hash E2E): PASS. Txn a2360b09-3978-49f7-afe8-160148ebcad4 went 202→200 within 37s, audit record hash populated, batch-done logs emitted on the designed cadence. - Spot-check B (SSRF): PASS. Both 169.254.169.254 (AWS metadata) and 10.0.0.1 (RFC1918) refused, wallet balance unchanged. - Spot-check C (signup rate limit): intentionally SKIPPED — would burn the verifying IP's 1/day quota on live prod; covered by unit + integration tests + live monitor M5. Also documents the post-deploy hotfix (PR #10 → commit 5f0b6c1) that unstuck four silently-broken background jobs via xact-scoped advisory locks (three jobs) plus a dedicated-connection pattern (test-scheduler). Green-light call: GO at 2026-04-17T18:55:00Z. Bake clock ends 2026-04-19T18:55:00Z. Includes: - PHASE_C_DEPLOY_OBSERVATIONS.md — finalized deploy narrative - BAKE_MONITORS.md — 6 monitors + cross-checks + checkpoint procedure (script-based; no Better Stack access) - apps/api/scripts/verify-locks.mjs — pg_locks + pending-row snapshot - apps/api/scripts/verify-phase-c-state.mjs — migration + distribution check Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

petterlindstrom79 and others added 2 commits April 17, 2026 20:46

petterlindstrom79 merged commit 5f0b6c1 into main Apr 17, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hotfix(jobs): xact-scoped advisory locks (unstick integrity-hash retry worker)#10

hotfix(jobs): xact-scoped advisory locks (unstick integrity-hash retry worker)#10
petterlindstrom79 merged 2 commits intomainfrom
claude/infallible-murdock-8d0bc1

petterlindstrom79 commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

petterlindstrom79 commented Apr 17, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant