Skip to content

hotfix(jobs): xact-scoped advisory locks (unstick integrity-hash retry worker)#10

Merged
petterlindstrom79 merged 2 commits intomainfrom
claude/infallible-murdock-8d0bc1
Apr 17, 2026
Merged

hotfix(jobs): xact-scoped advisory locks (unstick integrity-hash retry worker)#10
petterlindstrom79 merged 2 commits intomainfrom
claude/infallible-murdock-8d0bc1

Conversation

@petterlindstrom79
Copy link
Copy Markdown
Member

Summary

Production hotfix for the stuck-advisory-lock bug that crippled the Phase C
integrity-hash retry worker post-deploy (zero batches in 15 min, /v1/audit/:id
stuck at 202). The same latent bug was present in three sibling jobs.

  • 3 jobs → xact-scoped lock inside db.transaction(async (tx) => ...):
    integrity-hash-retry.ts, activation-drip.ts, db-retention.ts.
    Lock pins to the callback's connection, auto-releases at commit/rollback.
  • test-scheduler.ts → dedicated postgres client (max: 1) holding the
    session lock.
    A 5–10 min poll cycle can't sit inside one transaction
    (would rollback all test_result writes on any failure, starve the pool).
    The dedicated client is not shared, so the pool-reuse gap cannot happen.
  • All four jobs now emit <job>-lock-busy logWarn on skip (structured,
    Better Stack visible) instead of plain console.log.

Root cause: pg_try_advisory_lock (session-scoped) on a pooled connection.
Drizzle's getDb() borrows connections per statement; the lock sat on
connection A while subsequent queries borrowed B; pg_advisory_unlock
silently no-op'd on the wrong connection. Lock stayed held forever.

Test plan

  • npx tsc --noEmit clean
  • npx vitest run — 192 passed, 4 skipped (audit-token local-env
    failure is pre-existing; CI sets AUDIT_HMAC_SECRET at workflow level)
  • grep pg_advisory_unlock / pg_try_advisory_lock — only remaining
    call sites are test-scheduler's deliberate dedicated-connection
    pattern + existing mocks in integrity-hash-retry.test.ts
  • CI green
  • Post-deploy: integrity-hash-batch-done log line within 40s of Railway
    restart
  • Post-deploy: pg_locks shows no stuck advisory lock on 20260417,
    20260402, 20260415, 314159
  • Post-deploy: compliance_hash_state = 'pending' count drains on the
    30s cadence
  • Spot-check A: new /v1/do txn → /v1/audit/:id 202 → wait → 200 with
    hash populated
  • Spot-check B: SSRF refusal (F-0-007)

Deferred: deep lock-busy unit tests across all four jobs require the Phase D
Testcontainers harness (real Postgres). Mocking pg_try_advisory_xact_lock
inside a drizzle transaction is a fragile proxy for the real bug this PR
fixes. Follow-up filed.

🤖 Generated with Claude Code

petterlindstrom79 and others added 2 commits April 17, 2026 20:46
…ck lock

Four background jobs held session-scoped pg_try_advisory_lock through
drizzle's getDb() pool. On pool reuse, the lock sat on connection A while
subsequent queries borrowed connection B — and the final pg_advisory_unlock
silently no-op'd on the wrong connection. Locks stayed held indefinitely.
Phase C's integrity-hash retry worker surfaced the bug in prod: zero batches
processed for 15 minutes post-deploy, /v1/audit/:id stuck at 202, pg_locks
snapshot showed the lock held by an idle PID. The three sibling jobs had the
same latent bug — just not visible because nothing regularly probed them.

Three jobs moved to xact-scoped locks inside db.transaction(async (tx) => ...):
  - integrity-hash-retry.ts  (30s cadence, batch size 50)
  - activation-drip.ts       (6h cadence, N small updates)
  - db-retention.ts          (24h cadence, 5 DELETEs)

The xact-lock variant pins the lock to the same connection as every
statement in the callback and auto-releases at commit/rollback — the
pool-reuse gap cannot happen.

test-scheduler.ts diverged: a poll cycle runs 5–10 minutes iterating up to
20 capabilities with live external HTTP calls. Wrapping it in a transaction
would (a) hold one pooled connection for the entire cycle, starving the
live API, and (b) rollback every test_result write on any single failure,
poisoning the SQS window. Instead it uses a dedicated postgres client
(max: 1) scoped to the lock alone — all test work still flows through the
regular pool and commits independently. Pool reuse is impossible because
the lock's client is not shared.

All four jobs now emit a structured `<job>-lock-busy` logWarn on skip
rather than plain console.log, so multi-instance contention shows up in
Better Stack.

Verification:
  - npx tsc --noEmit → clean
  - npx vitest run   → 192 passed, 4 skipped (audit-token local-env
    failure is pre-existing; CI sets AUDIT_HMAC_SECRET at workflow level)
  - grep pg_advisory_unlock / pg_try_advisory_lock → only remaining call
    sites are test-scheduler's deliberate dedicated-connection pattern and
    existing mocks in integrity-hash-retry.test.ts.

Deferred: deep lock-busy unit tests across all four jobs require the
Phase D Testcontainers harness (real Postgres) to exercise the actual
advisory-lock semantics — mocking pg_try_advisory_xact_lock inside a
drizzle transaction is a fragile proxy for the real bug this commit fixes.
Tracked as a follow-up; not a blocker for the production hotfix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ler lock helper

The F-0-009 no-bare-catch lint rule flagged the two .catch blocks in the
dedicated-connection advisory-lock helper. A failure there is harmless
(session ends implicitly, PG releases the lock), but the rule — rightly —
demands a structured log so operators see anomalies instead of them
vanishing silently. Swap both to `.catch((err) => logError(...))` with
the lockId in context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@petterlindstrom79 petterlindstrom79 merged commit 5f0b6c1 into main Apr 17, 2026
1 check passed
petterlindstrom79 added a commit that referenced this pull request Apr 17, 2026
Close out Phase C with the verified post-hotfix state. Fills every PENDING
row in the original observations scaffold with real results:

  - Migrations: 0046 rate_limit_counters + 0047 compliance_hash_state both
    applied. 40,002 rows all 'complete', 0 pending, 0 failed. The other
    workflow's integrity_hash_status column is unchanged from the pre-deploy
    investigation baseline (39,510 / 287 / 150 / 55).
  - API boot: /health 200, every expected boot log line present, job
    startups visible at 18:55–18:57 UTC.
  - Log sink: Better Stack not configured on Railway; Pino → stdout →
    Railway logs. Shape correct. Bake monitoring runs via script, not
    dashboard rules — documented in BAKE_MONITORS.md.
  - Route gating: /v1/public/ops/* 200, /v1/internal/* 401. Admin wall
    active and public allowlist carries anonymous traffic.
  - Spot-check A (integrity hash E2E): PASS. Txn
    a2360b09-3978-49f7-afe8-160148ebcad4 went 202→200 within 37s,
    audit record hash populated, batch-done logs emitted on the designed
    cadence.
  - Spot-check B (SSRF): PASS. Both 169.254.169.254 (AWS metadata) and
    10.0.0.1 (RFC1918) refused, wallet balance unchanged.
  - Spot-check C (signup rate limit): intentionally SKIPPED — would burn
    the verifying IP's 1/day quota on live prod; covered by unit +
    integration tests + live monitor M5.

Also documents the post-deploy hotfix (PR #10 → commit 5f0b6c1) that
unstuck four silently-broken background jobs via xact-scoped advisory
locks (three jobs) plus a dedicated-connection pattern (test-scheduler).

Green-light call: GO at 2026-04-17T18:55:00Z. Bake clock ends
2026-04-19T18:55:00Z.

Includes:
  - PHASE_C_DEPLOY_OBSERVATIONS.md — finalized deploy narrative
  - BAKE_MONITORS.md — 6 monitors + cross-checks + checkpoint procedure
                       (script-based; no Better Stack access)
  - apps/api/scripts/verify-locks.mjs — pg_locks + pending-row snapshot
  - apps/api/scripts/verify-phase-c-state.mjs — migration + distribution check

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant