v0.42.50.0 fix(autopilot): kill the dead-job storm + supervisor queue wedge (#2194 #2227 #1994) by garrytan · Pull Request #2249 · garrytan/gbrain

garrytan · 2026-06-17T13:50:33Z

Summary

Fixes the recurring failure mode on multi-source Postgres brains where autopilot manufactured a continuous stream of dead autopilot-cycle jobs and the minion supervisor periodically wedged the queue it is meant to keep alive. Root cause was one disease with several interacting parts; this wave addresses all of them. Closes gbrain#2194, gbrain#2227, gbrain#1994.

What changed (8 atomic commits)

test — pin that the duplicate-supervisor LOCK_HELD fence-exit is never counted as a crash (Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227).
fix (prerequisite) — per-source cycles bind filesystem phases to source.local_path, not the global repo path, so FS/DB phases and the freshness stamp agree on the source (autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge #2194/Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227, TODOS:634).
fix — gbrain jobs supervisor status / gbrain doctor detect a live supervisor via the queue-scoped DB lock when the $HOME-derived pidfile is absent. Keyed on lock freshness, never a bare PID probe (PID-reuse-safe) (Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227).
fix — supervisor degrades to capped-backoff retry instead of permanently giving up at the soft crash budget; self-heals on a transient DB blip; hard ceiling via GBRAIN_SUPERVISOR_HARD_STOP_CRASHES (supervisor: max_crashes_exceeded gives up permanently on transient DB outages — needs retry backoff instead of hard stop #1994/Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227, TODOS:92).
feat — clamp per-tick fan-out to max(1, concurrency−1) (gated on a live supervisor) + a gbrain doctor warning on mismatch (autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge #2194).
feat — per-source failure cooldown (bounded exponential 10→120min), checked at dispatch AND claim time; a success clears it (autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge #2194).
feat — split the cycle: per-source jobs run only source-scoped phases; a single autopilot-global-maintenance job runs the brain-wide phases once per window (structural single-flight). This kills the 4→10GB RSS blowout that was the shared root cause of autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge #2194 and Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227 (autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge #2194/Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227).
fix — categorize the new doctor check + a nullable-engine guard.

Why the split (the load-bearing decision)

The plan first proposed single-flight-skip (concurrent cycles skip the global phases if another holds a lock). A codex outside-voice review during /plan-eng-review caught that skip-then-stamp-fresh poisons the only freshness gate — a source gets marked done while its new rows go unembedded. The split design (per-source phases + one global-maintenance job) is the codebase's documented Phase-2 design: no skip semantics, no freshness poisoning, single-flight by construction.

New config / surface

Jobs: autopilot-global-maintenance
Config: autopilot.failure_cooldown_min (0=off), autopilot.failure_cooldown_cap_min, autopilot.fanout_clamp_to_concurrency, autopilot.global_floor_min
Env: GBRAIN_SUPERVISOR_HARD_STOP_CRASHES
Doctor check: autopilot_fanout_concurrency
Source config field: last_source_cycle_at; brain config: autopilot.last_global_at

All on by default; existing brains pick them up on the next tick with no migration (one catch-up global pass on first tick).

Test plan

bun run typecheck — clean.
New/changed unit tests across the blast radius all pass (supervisor, child-worker-supervisor, db-lock, fanout, cooldown, global-maintenance, doctor, doctor-categories) — 438 tests green post-merge.
Engine-parity: new test/e2e/autopilot-cooldown-parity.test.ts (PGLite half local; Postgres half runs in CI).
Reviewed via /plan-eng-review + codex outside-voice (logged CLEAR).

🤖 Generated with Claude Code

#2227) A duplicate supervisor loses the queue-scoped DB singleton lock (#1849) and exits LOCK_HELD before spawning a worker or emitting 'started'. summarizeCrashes counts only worker_exited, so the fence path is structurally uncountable. Pin it so a future refactor that logs worker_exited on the fence path fails here instead of silently re-introducing the crash-budget breaker-trip loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…, not global repo (#2194 #2227) A per-source autopilot-cycle inherited the global sync.repo_path as brainDir while stamping DB freshness for source_id — mixed scope. FS phases (sync/lint/extract) ran against the wrong tree, so the failure-cooldown and freshness gates would attribute work to the wrong source. Resolve the source's local_path in the handler (reuse the archive-recheck SELECT) and bind brainDir to it; a pure-DB source gets null (FS phases skip) instead of falling through to the global checkout. Legacy no-source dispatch keeps the global repoPath. Prerequisite for the cooldown/split commits (codex outside-voice #8). Resolves TODOS:634. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… $HOME (#2227) jobs supervisor status + doctor read the HOME-derived pidfile, so a supervisor started under a different $HOME (keeper=/root vs ops=/data) read as 'not running' while healthy — the false signal that drives an operator to spawn a duplicate. Both surfaces now fall back to the queue-scoped DB singleton lock (#1849), the HOME-independent authority, when the pidfile shows nothing. New isLockHolderLive keys on lock freshness (ttl + heartbeat steal-grace), never process.kill, so PID reuse can't false-positive (pid-liveness-alone-pid-reuse). Status surfaces the holder host/pid + recorded concurrency/max-rss from the latest started event. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… storm (#1994 #2227) max_crashes_exceeded gave up forever, so a transient DB-pooler blip that tripped the soft budget wedged the queue until a human restart (#2227's breaker-trips tail). Crossing the soft budget now enters degraded mode: keep respawning with capped exponential backoff (60s cap — a paced retry, not a hot loop) and emit a loud crash_budget_degraded health_warn. The existing stable-run reset clears the count once a respawn survives >5min, so a recovered DB self-heals. Permanent give-up fires only at a much-higher hard ceiling (maxCrashes × 10), tunable/disablable via GBRAIN_SUPERVISOR_HARD_STOP_CRASHES (0 = never). Resolves TODOS:92. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…#2194) Fan-out resolved to 4 (Postgres) regardless of worker --concurrency, so surplus cycles queued behind the worker and raced the stalled-sweeper. Two fixes for the same mismatch: - resolveEffectiveFanoutMax clamps to max(1, concurrency-1) (reserve a slot), gated on a LIVE DB-lock holder so a stale started-audit row can't shrink throughput (codex #9/D5); no live holder → unknown → unclamped base. Escape hatch autopilot.fanout_clamp_to_concurrency. - doctor's autopilot_fanout_concurrency check warns when fan-out exceeds effective slots — the misconfig was silent before. Advisory (started-event concurrency), wired into both doctor surfaces. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rm (#2194) Only SUCCESS gated dispatch, so a source whose cycle kept failing/timing-out re-fanned-out every 5-min tick forever (200+ dead jobs/24h). Now a failed source backs off with bounded exponential cooldown (10→120min). Read at DISPATCH from minion_jobs dead/failed rows (timeouts/RSS-kills dead-letter via SQL and never run handler code, so a write-only hook would miss them) AND re-checked at CLAIM time in the handler (codex #5: already-queued/retrying jobs). A success clears it (codex #7); null-source rows excluded (codex #6); engine-parity via executeRaw. Disable with autopilot.failure_cooldown_min=0. Fail-open if config/history reads error. Surfaced via fanout_cooldown_skipped + the fanout summary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ntenance job (#2194 #2227) N per-source cycles each ran the brain-wide global phases (embed-all/orphans/ purge/…) concurrently, thrashing the same rows and taking the worker 4→10GB in <60s → RSS-kill → orphaned stalls. Split them: per-source jobs now run only source-scoped (+ mixed) phases and stamp last_source_cycle_at; a new autopilot-global-maintenance job runs the global phases ONCE per window (idempotency_key + maxWaiting:1 = structural single-flight) and stamps autopilot.last_global_at. This is the codex-endorsed design that replaced the rejected skip-and-stamp-fresh approach (codex #1/#2): no freshness poisoning, no starvation — global work always runs as its own job, never marked done when it wasn't. PHASE_SCOPE is now a runtime partition (GLOBAL ∪ NON_GLOBAL == ALL). last_full_cycle_at still written for doctor/legacy (no longer a global gate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

) Follow-up to the supervisor-visibility commit: doctor's engine binding is BrainEngine | null, so the inspectLock fallback must guard on a non-null engine (tsc TS2345). No behavior change — a null engine simply skips the DB-lock probe and falls back to the pidfile reading, as before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

) Follow-up to the fan-out/concurrency commit: the doctor-categories drift guard requires every check name in doctor.ts to belong to exactly one category set. Add the new autopilot_fanout_concurrency check to OPS_CHECK_NAMES (infrastructure liveness, alongside wedged_queue/supervisor). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-wedge-fix

… wedge (#2194 #2227 #1994) Split the per-source cycle (source phases fan out; one global-maintenance job runs the brain-wide phases once) to end the 4→10GB RSS blowout; add per-source failure cooldown, fan-out-to-concurrency clamp + doctor warning, supervisor degraded-retry instead of permanent give-up, and DB-lock supervisor visibility under split $HOME. Resolves TODOS:634 (repoPath) and TODOS:92 (#1994 backoff). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…raded-retry (#2194 #2227) Post-ship document-release: refresh the KEY_FILES current-state entries that drifted — cycle.ts (GLOBAL/NON_GLOBAL phase split + last_source_cycle_at / autopilot.last_global_at), jobs.ts (per-source local_path brainDir, claim-time cooldown, autopilot-global-maintenance handler), supervisor.ts + child-worker (degraded retry instead of permanent give-up; hard ceiling), db-lock.ts (isLockHolderLive), handler-timeouts (new handler). Regenerated llms bundle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-wedge-fix # Conflicts: # CHANGELOG.md # VERSION # package.json

garrytan · 2026-06-18T18:05:20Z

Superseded by #2287. This branch's 8 functional commits + the KEY_FILES doc update were replayed commit-for-commit onto current master in #2287 (re-versioned v0.42.50.0 → v0.42.52.0, since 0.42.50.0/0.42.51.0 were taken by #2254/#2255), bundled with four adjacent reliability fixes (#1737 #1738 #1950 #1984). #2287 closes #2194/#2227/#1994. Closing this in favor of #2287.

…2194) The cherry-picked autopilot-fanout-clamp + doctor-autopilot-fanout-concurrency tests mutate process.env.GBRAIN_AUDIT_DIR in beforeEach/afterEach, which the check:test-isolation R1 lint flags (parallel shards load multiple files per process). Rename to *.serial.test.ts (sanctioned quarantine — they run under --max-concurrency=1) instead of restructuring the reviewed test bodies. No logic change; both files stay green (9 tests). Fixes the failing verify CI check.

garrytan and others added 13 commits June 16, 2026 22:36

Merge remote-tracking branch 'origin/master' into garrytan/supervisor…

64a4c72

…-wedge-fix

Merge remote-tracking branch 'origin/master' into garrytan/supervisor…

94f7c73

…-wedge-fix # Conflicts: # CHANGELOG.md # VERSION # package.json

garrytan mentioned this pull request Jun 18, 2026

v0.42.52.0 fix(reliability): autopilot dead-job storm + supervisor wedge + sync/status/minion reliability (#2194 #2227 #1994 #1737 #1738 #1950 #1984) #2287

Open

garrytan closed this Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.42.50.0 fix(autopilot): kill the dead-job storm + supervisor queue wedge (#2194 #2227 #1994)#2249

v0.42.50.0 fix(autopilot): kill the dead-job storm + supervisor queue wedge (#2194 #2227 #1994)#2249
garrytan wants to merge 13 commits into
masterfrom
garrytan/supervisor-wedge-fix

garrytan commented Jun 17, 2026

Uh oh!

garrytan commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Jun 17, 2026

Summary

What changed (8 atomic commits)

Why the split (the load-bearing decision)

New config / surface

Test plan

Uh oh!

garrytan commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant