Skip to content

v0.42.50.0 fix(autopilot): kill the dead-job storm + supervisor queue wedge (#2194 #2227 #1994)#2249

Closed
garrytan wants to merge 13 commits into
masterfrom
garrytan/supervisor-wedge-fix
Closed

v0.42.50.0 fix(autopilot): kill the dead-job storm + supervisor queue wedge (#2194 #2227 #1994)#2249
garrytan wants to merge 13 commits into
masterfrom
garrytan/supervisor-wedge-fix

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

Fixes the recurring failure mode on multi-source Postgres brains where autopilot manufactured a continuous stream of dead autopilot-cycle jobs and the minion supervisor periodically wedged the queue it is meant to keep alive. Root cause was one disease with several interacting parts; this wave addresses all of them. Closes gbrain#2194, gbrain#2227, gbrain#1994.

What changed (8 atomic commits)

  1. test — pin that the duplicate-supervisor LOCK_HELD fence-exit is never counted as a crash (Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227).
  2. fix (prerequisite) — per-source cycles bind filesystem phases to source.local_path, not the global repo path, so FS/DB phases and the freshness stamp agree on the source (autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge #2194/Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227, TODOS:634).
  3. fixgbrain jobs supervisor status / gbrain doctor detect a live supervisor via the queue-scoped DB lock when the $HOME-derived pidfile is absent. Keyed on lock freshness, never a bare PID probe (PID-reuse-safe) (Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227).
  4. fix — supervisor degrades to capped-backoff retry instead of permanently giving up at the soft crash budget; self-heals on a transient DB blip; hard ceiling via GBRAIN_SUPERVISOR_HARD_STOP_CRASHES (supervisor: max_crashes_exceeded gives up permanently on transient DB outages — needs retry backoff instead of hard stop #1994/Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227, TODOS:92).
  5. feat — clamp per-tick fan-out to max(1, concurrency−1) (gated on a live supervisor) + a gbrain doctor warning on mismatch (autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge #2194).
  6. feat — per-source failure cooldown (bounded exponential 10→120min), checked at dispatch AND claim time; a success clears it (autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge #2194).
  7. featsplit the cycle: per-source jobs run only source-scoped phases; a single autopilot-global-maintenance job runs the brain-wide phases once per window (structural single-flight). This kills the 4→10GB RSS blowout that was the shared root cause of autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge #2194 and Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227 (autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge #2194/Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227).
  8. fix — categorize the new doctor check + a nullable-engine guard.

Why the split (the load-bearing decision)

The plan first proposed single-flight-skip (concurrent cycles skip the global phases if another holds a lock). A codex outside-voice review during /plan-eng-review caught that skip-then-stamp-fresh poisons the only freshness gate — a source gets marked done while its new rows go unembedded. The split design (per-source phases + one global-maintenance job) is the codebase's documented Phase-2 design: no skip semantics, no freshness poisoning, single-flight by construction.

New config / surface

  • Jobs: autopilot-global-maintenance
  • Config: autopilot.failure_cooldown_min (0=off), autopilot.failure_cooldown_cap_min, autopilot.fanout_clamp_to_concurrency, autopilot.global_floor_min
  • Env: GBRAIN_SUPERVISOR_HARD_STOP_CRASHES
  • Doctor check: autopilot_fanout_concurrency
  • Source config field: last_source_cycle_at; brain config: autopilot.last_global_at

All on by default; existing brains pick them up on the next tick with no migration (one catch-up global pass on first tick).

Test plan

  • bun run typecheck — clean.
  • New/changed unit tests across the blast radius all pass (supervisor, child-worker-supervisor, db-lock, fanout, cooldown, global-maintenance, doctor, doctor-categories) — 438 tests green post-merge.
  • Engine-parity: new test/e2e/autopilot-cooldown-parity.test.ts (PGLite half local; Postgres half runs in CI).
  • Reviewed via /plan-eng-review + codex outside-voice (logged CLEAR).

🤖 Generated with Claude Code

garrytan and others added 13 commits June 16, 2026 22:36
#2227)

A duplicate supervisor loses the queue-scoped DB singleton lock (#1849) and
exits LOCK_HELD before spawning a worker or emitting 'started'. summarizeCrashes
counts only worker_exited, so the fence path is structurally uncountable. Pin it
so a future refactor that logs worker_exited on the fence path fails here instead
of silently re-introducing the crash-budget breaker-trip loop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…, not global repo (#2194 #2227)

A per-source autopilot-cycle inherited the global sync.repo_path as brainDir while
stamping DB freshness for source_id — mixed scope. FS phases (sync/lint/extract)
ran against the wrong tree, so the failure-cooldown and freshness gates would
attribute work to the wrong source. Resolve the source's local_path in the handler
(reuse the archive-recheck SELECT) and bind brainDir to it; a pure-DB source gets
null (FS phases skip) instead of falling through to the global checkout. Legacy
no-source dispatch keeps the global repoPath. Prerequisite for the cooldown/split
commits (codex outside-voice #8). Resolves TODOS:634.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… $HOME (#2227)

jobs supervisor status + doctor read the HOME-derived pidfile, so a supervisor
started under a different $HOME (keeper=/root vs ops=/data) read as 'not running'
while healthy — the false signal that drives an operator to spawn a duplicate.
Both surfaces now fall back to the queue-scoped DB singleton lock (#1849), the
HOME-independent authority, when the pidfile shows nothing. New isLockHolderLive
keys on lock freshness (ttl + heartbeat steal-grace), never process.kill, so PID
reuse can't false-positive (pid-liveness-alone-pid-reuse). Status surfaces the
holder host/pid + recorded concurrency/max-rss from the latest started event.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… storm (#1994 #2227)

max_crashes_exceeded gave up forever, so a transient DB-pooler blip that tripped
the soft budget wedged the queue until a human restart (#2227's breaker-trips tail).
Crossing the soft budget now enters degraded mode: keep respawning with capped
exponential backoff (60s cap — a paced retry, not a hot loop) and emit a loud
crash_budget_degraded health_warn. The existing stable-run reset clears the count
once a respawn survives >5min, so a recovered DB self-heals. Permanent give-up
fires only at a much-higher hard ceiling (maxCrashes × 10), tunable/disablable via
GBRAIN_SUPERVISOR_HARD_STOP_CRASHES (0 = never). Resolves TODOS:92.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#2194)

Fan-out resolved to 4 (Postgres) regardless of worker --concurrency, so surplus
cycles queued behind the worker and raced the stalled-sweeper. Two fixes for the
same mismatch:
- resolveEffectiveFanoutMax clamps to max(1, concurrency-1) (reserve a slot),
  gated on a LIVE DB-lock holder so a stale started-audit row can't shrink
  throughput (codex #9/D5); no live holder → unknown → unclamped base. Escape
  hatch autopilot.fanout_clamp_to_concurrency.
- doctor's autopilot_fanout_concurrency check warns when fan-out exceeds
  effective slots — the misconfig was silent before. Advisory (started-event
  concurrency), wired into both doctor surfaces.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rm (#2194)

Only SUCCESS gated dispatch, so a source whose cycle kept failing/timing-out
re-fanned-out every 5-min tick forever (200+ dead jobs/24h). Now a failed source
backs off with bounded exponential cooldown (10→120min). Read at DISPATCH from
minion_jobs dead/failed rows (timeouts/RSS-kills dead-letter via SQL and never
run handler code, so a write-only hook would miss them) AND re-checked at CLAIM
time in the handler (codex #5: already-queued/retrying jobs). A success clears it
(codex #7); null-source rows excluded (codex #6); engine-parity via executeRaw.
Disable with autopilot.failure_cooldown_min=0. Fail-open if config/history reads
error. Surfaced via fanout_cooldown_skipped + the fanout summary.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ntenance job (#2194 #2227)

N per-source cycles each ran the brain-wide global phases (embed-all/orphans/
purge/…) concurrently, thrashing the same rows and taking the worker 4→10GB in
<60s → RSS-kill → orphaned stalls. Split them: per-source jobs now run only
source-scoped (+ mixed) phases and stamp last_source_cycle_at; a new
autopilot-global-maintenance job runs the global phases ONCE per window
(idempotency_key + maxWaiting:1 = structural single-flight) and stamps
autopilot.last_global_at. This is the codex-endorsed design that replaced the
rejected skip-and-stamp-fresh approach (codex #1/#2): no freshness poisoning, no
starvation — global work always runs as its own job, never marked done when it
wasn't. PHASE_SCOPE is now a runtime partition (GLOBAL ∪ NON_GLOBAL == ALL).
last_full_cycle_at still written for doctor/legacy (no longer a global gate).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
)

Follow-up to the supervisor-visibility commit: doctor's engine binding is
BrainEngine | null, so the inspectLock fallback must guard on a non-null engine
(tsc TS2345). No behavior change — a null engine simply skips the DB-lock probe
and falls back to the pidfile reading, as before.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
)

Follow-up to the fan-out/concurrency commit: the doctor-categories drift guard
requires every check name in doctor.ts to belong to exactly one category set.
Add the new autopilot_fanout_concurrency check to OPS_CHECK_NAMES (infrastructure
liveness, alongside wedged_queue/supervisor).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… wedge (#2194 #2227 #1994)

Split the per-source cycle (source phases fan out; one global-maintenance job runs
the brain-wide phases once) to end the 4→10GB RSS blowout; add per-source failure
cooldown, fan-out-to-concurrency clamp + doctor warning, supervisor degraded-retry
instead of permanent give-up, and DB-lock supervisor visibility under split $HOME.
Resolves TODOS:634 (repoPath) and TODOS:92 (#1994 backoff).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…raded-retry (#2194 #2227)

Post-ship document-release: refresh the KEY_FILES current-state entries that
drifted — cycle.ts (GLOBAL/NON_GLOBAL phase split + last_source_cycle_at /
autopilot.last_global_at), jobs.ts (per-source local_path brainDir, claim-time
cooldown, autopilot-global-maintenance handler), supervisor.ts + child-worker
(degraded retry instead of permanent give-up; hard ceiling), db-lock.ts
(isLockHolderLive), handler-timeouts (new handler). Regenerated llms bundle.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-wedge-fix

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
@garrytan

Copy link
Copy Markdown
Owner Author

Superseded by #2287. This branch's 8 functional commits + the KEY_FILES doc update were replayed commit-for-commit onto current master in #2287 (re-versioned v0.42.50.0 → v0.42.52.0, since 0.42.50.0/0.42.51.0 were taken by #2254/#2255), bundled with four adjacent reliability fixes (#1737 #1738 #1950 #1984). #2287 closes #2194/#2227/#1994. Closing this in favor of #2287.

@garrytan garrytan closed this Jun 18, 2026
garrytan added a commit that referenced this pull request Jun 18, 2026
…2194)

The cherry-picked autopilot-fanout-clamp + doctor-autopilot-fanout-concurrency
tests mutate process.env.GBRAIN_AUDIT_DIR in beforeEach/afterEach, which the
check:test-isolation R1 lint flags (parallel shards load multiple files per
process). Rename to *.serial.test.ts (sanctioned quarantine — they run under
--max-concurrency=1) instead of restructuring the reviewed test bodies. No logic
change; both files stay green (9 tests). Fixes the failing verify CI check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant