hotfix(jobs): xact-scoped advisory locks (unstick integrity-hash retry worker)#10
Merged
petterlindstrom79 merged 2 commits intomainfrom Apr 17, 2026
Merged
Conversation
…ck lock
Four background jobs held session-scoped pg_try_advisory_lock through
drizzle's getDb() pool. On pool reuse, the lock sat on connection A while
subsequent queries borrowed connection B — and the final pg_advisory_unlock
silently no-op'd on the wrong connection. Locks stayed held indefinitely.
Phase C's integrity-hash retry worker surfaced the bug in prod: zero batches
processed for 15 minutes post-deploy, /v1/audit/:id stuck at 202, pg_locks
snapshot showed the lock held by an idle PID. The three sibling jobs had the
same latent bug — just not visible because nothing regularly probed them.
Three jobs moved to xact-scoped locks inside db.transaction(async (tx) => ...):
- integrity-hash-retry.ts (30s cadence, batch size 50)
- activation-drip.ts (6h cadence, N small updates)
- db-retention.ts (24h cadence, 5 DELETEs)
The xact-lock variant pins the lock to the same connection as every
statement in the callback and auto-releases at commit/rollback — the
pool-reuse gap cannot happen.
test-scheduler.ts diverged: a poll cycle runs 5–10 minutes iterating up to
20 capabilities with live external HTTP calls. Wrapping it in a transaction
would (a) hold one pooled connection for the entire cycle, starving the
live API, and (b) rollback every test_result write on any single failure,
poisoning the SQS window. Instead it uses a dedicated postgres client
(max: 1) scoped to the lock alone — all test work still flows through the
regular pool and commits independently. Pool reuse is impossible because
the lock's client is not shared.
All four jobs now emit a structured `<job>-lock-busy` logWarn on skip
rather than plain console.log, so multi-instance contention shows up in
Better Stack.
Verification:
- npx tsc --noEmit → clean
- npx vitest run → 192 passed, 4 skipped (audit-token local-env
failure is pre-existing; CI sets AUDIT_HMAC_SECRET at workflow level)
- grep pg_advisory_unlock / pg_try_advisory_lock → only remaining call
sites are test-scheduler's deliberate dedicated-connection pattern and
existing mocks in integrity-hash-retry.test.ts.
Deferred: deep lock-busy unit tests across all four jobs require the
Phase D Testcontainers harness (real Postgres) to exercise the actual
advisory-lock semantics — mocking pg_try_advisory_xact_lock inside a
drizzle transaction is a fragile proxy for the real bug this commit fixes.
Tracked as a follow-up; not a blocker for the production hotfix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ler lock helper The F-0-009 no-bare-catch lint rule flagged the two .catch blocks in the dedicated-connection advisory-lock helper. A failure there is harmless (session ends implicitly, PG releases the lock), but the rule — rightly — demands a structured log so operators see anomalies instead of them vanishing silently. Swap both to `.catch((err) => logError(...))` with the lockId in context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
petterlindstrom79
added a commit
that referenced
this pull request
Apr 17, 2026
Close out Phase C with the verified post-hotfix state. Fills every PENDING
row in the original observations scaffold with real results:
- Migrations: 0046 rate_limit_counters + 0047 compliance_hash_state both
applied. 40,002 rows all 'complete', 0 pending, 0 failed. The other
workflow's integrity_hash_status column is unchanged from the pre-deploy
investigation baseline (39,510 / 287 / 150 / 55).
- API boot: /health 200, every expected boot log line present, job
startups visible at 18:55–18:57 UTC.
- Log sink: Better Stack not configured on Railway; Pino → stdout →
Railway logs. Shape correct. Bake monitoring runs via script, not
dashboard rules — documented in BAKE_MONITORS.md.
- Route gating: /v1/public/ops/* 200, /v1/internal/* 401. Admin wall
active and public allowlist carries anonymous traffic.
- Spot-check A (integrity hash E2E): PASS. Txn
a2360b09-3978-49f7-afe8-160148ebcad4 went 202→200 within 37s,
audit record hash populated, batch-done logs emitted on the designed
cadence.
- Spot-check B (SSRF): PASS. Both 169.254.169.254 (AWS metadata) and
10.0.0.1 (RFC1918) refused, wallet balance unchanged.
- Spot-check C (signup rate limit): intentionally SKIPPED — would burn
the verifying IP's 1/day quota on live prod; covered by unit +
integration tests + live monitor M5.
Also documents the post-deploy hotfix (PR #10 → commit 5f0b6c1) that
unstuck four silently-broken background jobs via xact-scoped advisory
locks (three jobs) plus a dedicated-connection pattern (test-scheduler).
Green-light call: GO at 2026-04-17T18:55:00Z. Bake clock ends
2026-04-19T18:55:00Z.
Includes:
- PHASE_C_DEPLOY_OBSERVATIONS.md — finalized deploy narrative
- BAKE_MONITORS.md — 6 monitors + cross-checks + checkpoint procedure
(script-based; no Better Stack access)
- apps/api/scripts/verify-locks.mjs — pg_locks + pending-row snapshot
- apps/api/scripts/verify-phase-c-state.mjs — migration + distribution check
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Production hotfix for the stuck-advisory-lock bug that crippled the Phase C
integrity-hash retry worker post-deploy (zero batches in 15 min,
/v1/audit/:idstuck at 202). The same latent bug was present in three sibling jobs.
db.transaction(async (tx) => ...):integrity-hash-retry.ts,activation-drip.ts,db-retention.ts.Lock pins to the callback's connection, auto-releases at commit/rollback.
test-scheduler.ts→ dedicatedpostgresclient (max: 1) holding thesession lock. A 5–10 min poll cycle can't sit inside one transaction
(would rollback all
test_resultwrites on any failure, starve the pool).The dedicated client is not shared, so the pool-reuse gap cannot happen.
<job>-lock-busylogWarnon skip (structured,Better Stack visible) instead of plain
console.log.Root cause:
pg_try_advisory_lock(session-scoped) on a pooled connection.Drizzle's
getDb()borrows connections per statement; the lock sat onconnection A while subsequent queries borrowed B;
pg_advisory_unlocksilently no-op'd on the wrong connection. Lock stayed held forever.
Test plan
npx tsc --noEmitcleannpx vitest run— 192 passed, 4 skipped (audit-token local-envfailure is pre-existing; CI sets
AUDIT_HMAC_SECRETat workflow level)grep pg_advisory_unlock / pg_try_advisory_lock— only remainingcall sites are
test-scheduler's deliberate dedicated-connectionpattern + existing mocks in
integrity-hash-retry.test.tsintegrity-hash-batch-donelog line within 40s of Railwayrestart
pg_locksshows no stuck advisory lock on 20260417,20260402, 20260415, 314159
compliance_hash_state = 'pending'count drains on the30s cadence
/v1/dotxn →/v1/audit/:id202 → wait → 200 withhash populated
Deferred: deep lock-busy unit tests across all four jobs require the Phase D
Testcontainers harness (real Postgres). Mocking
pg_try_advisory_xact_lockinside a drizzle transaction is a fragile proxy for the real bug this PR
fixes. Follow-up filed.
🤖 Generated with Claude Code