v0.42.52.0 fix(reliability): autopilot dead-job storm + supervisor wedge + sync/status/minion reliability (#2194 #2227 #1994 #1737 #1738 #1950 #1984) by garrytan · Pull Request #2287 · garrytan/gbrain

garrytan · 2026-06-18T18:04:50Z

Summary

One reliability fix wave, scoped from the open garrytan-agents issue backlog. It lands the already-reviewed autopilot/supervisor stabilization (previously stranded as an unmerged PR) plus four bounded operational fixes found alongside.

On a multi-source Postgres brain, autopilot could fan out a continuous stream of dead autopilot-cycle jobs while the supervisor periodically wedged the very queue it exists to keep alive — one disease with several interacting parts. This addresses all of them, then cleans up four smaller reliability bugs.

What's in the wave

Autopilot / supervisor (Closes #2194, #2227, #1994) — replayed from the reviewed work, commit-for-commit:

Cycle split: per-source jobs run only source-scoped phases; a single autopilot-global-maintenance job runs brain-wide phases once per window — removes the per-cycle memory blow-up that was the shared root cause.
Per-source filesystem phases bind to the source's own path (freshness stamp and work agree on which source ran).
Supervisor degrades to capped-backoff retry instead of a permanent stop on a transient DB blip; detects a live sibling supervisor via the queue's DB lock (not a $HOME-derived pidfile).
Per-source failure cooldown (bounded exponential) + per-tick fan-out clamp + a gbrain doctor mismatch check.

Operational fixes:

Minions: long-running jobs (subagent/embed-backfill/autopilot-cycle) thrash on wall-clock timeout with broken attempt accounting, starving the queue #1737 — a timed-out minion run now counts as a spent attempt (handleTimeouts was the one dead-letter path that didn't increment, so long-lane jobs read attempts: 0/N). (Attempt-accounting half; the queue-starvation half stays a measurement-gated v0.43+ follow-up — see TODOS.)
gbrain agent run: --detach / flags after the prompt get swallowed into the prompt string #1738 (Closes) — gbrain agent run no longer swallows a trailing --detach/--follow into the prompt; a --word inside the prompt stays verbatim; -- ends flag parsing anywhere.
sync --source default wedged ~29min holding gbrain-sync lock; status reports 'idle' while proc live; no way to reap without pkill #1950 — gbrain sources status reports a live sync as running (not "idle") via a shared liveSyncStatus helper; a progress-aware stall watchdog aborts a wedged drain and releases its lock so the next sync resumes from the checkpoint (no manual pkill). (In-flight single-file abort is a documented follow-up in TODOS.)
Operational health pollers blow their budget: every CLI call has a ~5.5s cold-start+connect floor and there is no consolidated 'gbrain health' command #1984 (Closes) — gbrain status gains a version field and a --deadline-ms/--fast per-section budget that returns a partial envelope instead of hanging a poller.

Pre-landing review

/plan-eng-review (3 deep-review agents + codex outside voice) hardened the design; 4 decisions resolved with the author.
Ship-stage adversarial pass (codex + a 9-agent diff-review workflow) found 6 real issues in the new code — all fixed before this push: stall-abort reason now distinct (stall_timeout), --deadline-ms missing-value is a usage error, thin-client stale-section reporting scoped to requested sections + losing remote call cancelled, and the -- escape now suppresses hoisting after a positional. Each fix has a regression test.

Test plan

Full sharded unit suite on the final code: pass=14036, fail=0 (the one transient failure was a VERSION/package.json torn read from bumping the version mid-run; re-confirmed green with consistent VERSION — test/cli.test.ts 18/18).
bun run typecheck clean. Targeted suites green: v0.42.50.0 fix(autopilot): kill the dead-job storm + supervisor queue wedge (#2194 #2227 #1994) #2249 (108), agent-cli + status-sections (52), cli (18), minions handleTimeouts.
No migration added; master's v118/v119 (v0.42.51.0 fix(sync): contention-free clock + checkpoint integrity + honest sync freshness #2255) preserved. Engine-parity intact (autopilot-cooldown parity test travels with the wave).

🤖 Generated with Claude Code

#2227) A duplicate supervisor loses the queue-scoped DB singleton lock (#1849) and exits LOCK_HELD before spawning a worker or emitting 'started'. summarizeCrashes counts only worker_exited, so the fence path is structurally uncountable. Pin it so a future refactor that logs worker_exited on the fence path fails here instead of silently re-introducing the crash-budget breaker-trip loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…, not global repo (#2194 #2227) A per-source autopilot-cycle inherited the global sync.repo_path as brainDir while stamping DB freshness for source_id — mixed scope. FS phases (sync/lint/extract) ran against the wrong tree, so the failure-cooldown and freshness gates would attribute work to the wrong source. Resolve the source's local_path in the handler (reuse the archive-recheck SELECT) and bind brainDir to it; a pure-DB source gets null (FS phases skip) instead of falling through to the global checkout. Legacy no-source dispatch keeps the global repoPath. Prerequisite for the cooldown/split commits (codex outside-voice #8). Resolves TODOS:634. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… $HOME (#2227) jobs supervisor status + doctor read the HOME-derived pidfile, so a supervisor started under a different $HOME (keeper=/root vs ops=/data) read as 'not running' while healthy — the false signal that drives an operator to spawn a duplicate. Both surfaces now fall back to the queue-scoped DB singleton lock (#1849), the HOME-independent authority, when the pidfile shows nothing. New isLockHolderLive keys on lock freshness (ttl + heartbeat steal-grace), never process.kill, so PID reuse can't false-positive (pid-liveness-alone-pid-reuse). Status surfaces the holder host/pid + recorded concurrency/max-rss from the latest started event. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… storm (#1994 #2227) max_crashes_exceeded gave up forever, so a transient DB-pooler blip that tripped the soft budget wedged the queue until a human restart (#2227's breaker-trips tail). Crossing the soft budget now enters degraded mode: keep respawning with capped exponential backoff (60s cap — a paced retry, not a hot loop) and emit a loud crash_budget_degraded health_warn. The existing stable-run reset clears the count once a respawn survives >5min, so a recovered DB self-heals. Permanent give-up fires only at a much-higher hard ceiling (maxCrashes × 10), tunable/disablable via GBRAIN_SUPERVISOR_HARD_STOP_CRASHES (0 = never). Resolves TODOS:92. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…#2194) Fan-out resolved to 4 (Postgres) regardless of worker --concurrency, so surplus cycles queued behind the worker and raced the stalled-sweeper. Two fixes for the same mismatch: - resolveEffectiveFanoutMax clamps to max(1, concurrency-1) (reserve a slot), gated on a LIVE DB-lock holder so a stale started-audit row can't shrink throughput (codex #9/D5); no live holder → unknown → unclamped base. Escape hatch autopilot.fanout_clamp_to_concurrency. - doctor's autopilot_fanout_concurrency check warns when fan-out exceeds effective slots — the misconfig was silent before. Advisory (started-event concurrency), wired into both doctor surfaces. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rm (#2194) Only SUCCESS gated dispatch, so a source whose cycle kept failing/timing-out re-fanned-out every 5-min tick forever (200+ dead jobs/24h). Now a failed source backs off with bounded exponential cooldown (10→120min). Read at DISPATCH from minion_jobs dead/failed rows (timeouts/RSS-kills dead-letter via SQL and never run handler code, so a write-only hook would miss them) AND re-checked at CLAIM time in the handler (codex #5: already-queued/retrying jobs). A success clears it (codex #7); null-source rows excluded (codex #6); engine-parity via executeRaw. Disable with autopilot.failure_cooldown_min=0. Fail-open if config/history reads error. Surfaced via fanout_cooldown_skipped + the fanout summary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ntenance job (#2194 #2227) N per-source cycles each ran the brain-wide global phases (embed-all/orphans/ purge/…) concurrently, thrashing the same rows and taking the worker 4→10GB in <60s → RSS-kill → orphaned stalls. Split them: per-source jobs now run only source-scoped (+ mixed) phases and stamp last_source_cycle_at; a new autopilot-global-maintenance job runs the global phases ONCE per window (idempotency_key + maxWaiting:1 = structural single-flight) and stamps autopilot.last_global_at. This is the codex-endorsed design that replaced the rejected skip-and-stamp-fresh approach (codex #1/#2): no freshness poisoning, no starvation — global work always runs as its own job, never marked done when it wasn't. PHASE_SCOPE is now a runtime partition (GLOBAL ∪ NON_GLOBAL == ALL). last_full_cycle_at still written for doctor/legacy (no longer a global gate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

) Follow-up to the supervisor-visibility commit: doctor's engine binding is BrainEngine | null, so the inspectLock fallback must guard on a non-null engine (tsc TS2345). No behavior change — a null engine simply skips the DB-lock probe and falls back to the pidfile reading, as before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

) Follow-up to the fan-out/concurrency commit: the doctor-categories drift guard requires every check name in doctor.ts to belong to exactly one category set. Add the new autopilot_fanout_concurrency check to OPS_CHECK_NAMES (infrastructure liveness, alongside wedged_queue/supervisor). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…raded-retry (#2194 #2227) Post-ship document-release: refresh the KEY_FILES current-state entries that drifted — cycle.ts (GLOBAL/NON_GLOBAL phase split + last_source_cycle_at / autopilot.last_global_at), jobs.ts (per-source local_path brainDir, claim-time cooldown, autopilot-global-maintenance handler), supervisor.ts + child-worker (degraded retry instead of permanent give-up; hard ceiling), db-lock.ts (isLockHolderLive), handler-timeouts (new handler). Regenerated llms bundle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…mpt (#1737) The per-job timeout_at dead-letter (handleTimeouts) set status='dead' without incrementing attempts_made, unlike the wall-clock and stall dead-letter siblings. It is the FIRST killer to fire for the long-lane handlers (subagent / embed-backfill / autopilot-cycle) because timeout_ms is stamped at submit, so a timed-out long job reported `attempts: 0/N (started: N)`. Mirror the siblings with attempts_made + 1 (terminal, no retry). Safe against double-count: the worker sweep runs handleStalled -> handleTimeouts -> handleWallClockTimeouts sequentially and awaited, each guarded on status='active', so the first to dead-letter excludes the row from the rest. Regression assertions added (test/minions.test.ts + e2e/minions-resilience.test.ts) so the increment can't be silently dropped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…freeform (#1738) parseRunFlags() broke flag parsing at the first positional token, so any flag after the prompt (`gbrain agent run "do X" --detach`) was swallowed into the prompt string and silently ignored. Now the no-value switches --detach/--follow/ --no-follow are hoisted when they trail the prompt, while everything else stays verbatim: an unknown --word is treated as prompt text (no "unknown flag" throw), a --switch mid-prompt is preserved, and `--` suppresses hoisting entirely for a literal escape. Value-flags now reject a missing or flag-shaped value (and --max-turns/--timeout-ms a non-number) instead of capturing undefined/NaN. Contract change: a prompt that starts with or trails an unguarded --word no longer errors; a literal trailing --detach needs `--`. Help text updated; tests revised + extended (test/agent-cli.test.ts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Finishes the #2255 honest-freshness story for two gaps it left. (a) `gbrain sources status` printed "idle" while a sync proc held the per-source lock (the reported bug). New shared liveSyncStatus() helper in db-lock.ts reads the SAME live-lock signal `gbrain doctor` uses; runStatus now shows "running" (BACKFILL column + a sync_running field in --json) and suppresses the misleading "never synced" warning while a sync is live. One helper, so the surfaces can't drift (doctor/status retrofit tracked as a follow-up). (b) A sync wedged-but-alive kept refreshing its lock heartbeat (it fires on its own timer) and hadn't hit the wall-clock deadline, so only a manual pkill freed it. New in-band stall watchdog keys off FORWARD IMPORT PROGRESS (progress.tick), not the heartbeat: if no file completes for GBRAIN_SYNC_STALL_ABORT_SECONDS (default 900s), it aborts via a controller composed into opts.signal, so the drain returns partial() (last_commit unchanged, next run resumes from the checkpoint) and withRefreshingLock releases the lock. Limits, documented in code: a single file slower than the window trips it; a fully starved event loop won't fire the timer (the wall-clock hard deadline is that backstop). Tests: liveSyncStatus (live/expired/none/per-source) in db-lock-inspect; the resolveStallAbortSeconds env matrix in sync-hard-deadline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`gbrain status` had no version in its JSON envelope and could hang on a slow connection with no way to get a partial answer. Two additions: - version: the StatusReport JSON now carries the local gbrain CLI version so a poller can pin behavior to a build. Thin-client also surfaces remote_version (the brain server's version), and the get_status_snapshot MCP op reports its version for that parity. - --deadline-ms=N / --fast: a shared wall-clock budget. Each section is raced against the REMAINING budget via Promise.race (NOT process-watchdog, which SIGKILLs and can't return partial output), so one slow/hung section can't strand the snapshot — it's marked stale and the rest still return. The envelope gains partial:true + stale_sections[]; exit code stays 0 (a snapshot was produced). Invalid --deadline-ms → exit 2. Tests: parseDeadlineFlag + withSectionDeadline (hermetic), the usage-error exit, version presence in the PGLite envelope, and the op's version key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…#1950) Pre-landing review (codex + adversarial): the stall watchdog aborted opts.signal but the per-iteration abort checks returned partial('timeout'), collapsing a wedge-reap into a user --timeout/SIGINT so JSON consumers couldn't tell them apart. Add a 'stall_timeout' reason (set via a stallAborted flag) on the three import-loop abort sites; deletes/renames-phase and checkpoint sites stay 'timeout'. Sharpen the watchdog comment: the abort is observed BETWEEN files, so a hang inside a single importFile is not interrupted until it returns (TODO: thread a cancellation signal through importFile).

…1738) Pre-landing review: the leading-flag loop breaks at the first positional, so the `escaped` flag only fired for a leading `--`. A `--` placed after a positional left trailing-switch hoisting active, so `agent run note -- body --detach` silently detached and dropped the `--` as junk. Suppress hoisting whenever a literal `--` appears in the prompt. Regression test added.

… losing remote call (#1984) Pre-landing review (codex): (1) bare `--deadline-ms` with no value silently fell through to no-budget/--fast instead of a usage error; (2) thin-client timeout reported both sync+cycle stale even under `--section sync`, naming a section the caller excluded (local path was already correct); (3) the section race abandoned the remote promise locally but didn't cancel the in-flight MCP call — pass the budget as timeoutMs so the losing side actually cancels. Regression test added.

…dge + sync/status/minion reliability (#2194 #2227 #1994 #1737 #1738 #1950 #1984) Bundles the already-reviewed autopilot/supervisor stabilization (#2194 #2227 #1994: cycle split, per-source failure cooldown, fan-out clamp, degraded supervisor retry, DB-lock live-supervisor detection) with four operational fixes: minion timeout attempt-accounting (#1737), agent-run trailing-flag parsing (#1738), honest live-sync sources status + progress-aware stall watchdog (#1950), and status version + --deadline-ms partial result (#1984). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Post-ship doc sync (/document-release): add the sync stall watchdog env var to the CLAUDE.md sync-tuning table (Five → Six knobs) + regenerate the llms bundle.

…2194) The cherry-picked autopilot-fanout-clamp + doctor-autopilot-fanout-concurrency tests mutate process.env.GBRAIN_AUDIT_DIR in beforeEach/afterEach, which the check:test-isolation R1 lint flags (parallel shards load multiple files per process). Rename to *.serial.test.ts (sanctioned quarantine — they run under --max-concurrency=1) instead of restructuring the reviewed test bodies. No logic change; both files stay green (9 tests). Fixes the failing verify CI check.

garrytan and others added 18 commits June 18, 2026 00:55

garrytan mentioned this pull request Jun 18, 2026

v0.42.50.0 fix(autopilot): kill the dead-job storm + supervisor queue wedge (#2194 #2227 #1994) #2249

Closed

garrytan added 2 commits June 18, 2026 11:06

docs: document GBRAIN_SYNC_STALL_ABORT_SECONDS env knob (#1950)

d778312

Post-ship doc sync (/document-release): add the sync stall watchdog env var to the CLAUDE.md sync-tuning table (Five → Six knobs) + regenerate the llms bundle.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.42.52.0 fix(reliability): autopilot dead-job storm + supervisor wedge + sync/status/minion reliability (#2194 #2227 #1994 #1737 #1738 #1950 #1984)#2287

v0.42.52.0 fix(reliability): autopilot dead-job storm + supervisor wedge + sync/status/minion reliability (#2194 #2227 #1994 #1737 #1738 #1950 #1984)#2287
garrytan wants to merge 20 commits into
masterfrom
garrytan/check-outstanding-issues

garrytan commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Jun 18, 2026

Summary

What's in the wave

Pre-landing review

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant