Skip to content

v0.42.52.0 fix(reliability): autopilot dead-job storm + supervisor wedge + sync/status/minion reliability (#2194 #2227 #1994 #1737 #1738 #1950 #1984)#2287

Open
garrytan wants to merge 20 commits into
masterfrom
garrytan/check-outstanding-issues
Open

v0.42.52.0 fix(reliability): autopilot dead-job storm + supervisor wedge + sync/status/minion reliability (#2194 #2227 #1994 #1737 #1738 #1950 #1984)#2287
garrytan wants to merge 20 commits into
masterfrom
garrytan/check-outstanding-issues

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

One reliability fix wave, scoped from the open garrytan-agents issue backlog. It lands the already-reviewed autopilot/supervisor stabilization (previously stranded as an unmerged PR) plus four bounded operational fixes found alongside.

On a multi-source Postgres brain, autopilot could fan out a continuous stream of dead autopilot-cycle jobs while the supervisor periodically wedged the very queue it exists to keep alive — one disease with several interacting parts. This addresses all of them, then cleans up four smaller reliability bugs.

What's in the wave

Autopilot / supervisor (Closes #2194, #2227, #1994) — replayed from the reviewed work, commit-for-commit:

  • Cycle split: per-source jobs run only source-scoped phases; a single autopilot-global-maintenance job runs brain-wide phases once per window — removes the per-cycle memory blow-up that was the shared root cause.
  • Per-source filesystem phases bind to the source's own path (freshness stamp and work agree on which source ran).
  • Supervisor degrades to capped-backoff retry instead of a permanent stop on a transient DB blip; detects a live sibling supervisor via the queue's DB lock (not a $HOME-derived pidfile).
  • Per-source failure cooldown (bounded exponential) + per-tick fan-out clamp + a gbrain doctor mismatch check.

Operational fixes:

Pre-landing review

  • /plan-eng-review (3 deep-review agents + codex outside voice) hardened the design; 4 decisions resolved with the author.
  • Ship-stage adversarial pass (codex + a 9-agent diff-review workflow) found 6 real issues in the new code — all fixed before this push: stall-abort reason now distinct (stall_timeout), --deadline-ms missing-value is a usage error, thin-client stale-section reporting scoped to requested sections + losing remote call cancelled, and the -- escape now suppresses hoisting after a positional. Each fix has a regression test.

Test plan

🤖 Generated with Claude Code

garrytan and others added 18 commits June 18, 2026 00:55
#2227)

A duplicate supervisor loses the queue-scoped DB singleton lock (#1849) and
exits LOCK_HELD before spawning a worker or emitting 'started'. summarizeCrashes
counts only worker_exited, so the fence path is structurally uncountable. Pin it
so a future refactor that logs worker_exited on the fence path fails here instead
of silently re-introducing the crash-budget breaker-trip loop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…, not global repo (#2194 #2227)

A per-source autopilot-cycle inherited the global sync.repo_path as brainDir while
stamping DB freshness for source_id — mixed scope. FS phases (sync/lint/extract)
ran against the wrong tree, so the failure-cooldown and freshness gates would
attribute work to the wrong source. Resolve the source's local_path in the handler
(reuse the archive-recheck SELECT) and bind brainDir to it; a pure-DB source gets
null (FS phases skip) instead of falling through to the global checkout. Legacy
no-source dispatch keeps the global repoPath. Prerequisite for the cooldown/split
commits (codex outside-voice #8). Resolves TODOS:634.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… $HOME (#2227)

jobs supervisor status + doctor read the HOME-derived pidfile, so a supervisor
started under a different $HOME (keeper=/root vs ops=/data) read as 'not running'
while healthy — the false signal that drives an operator to spawn a duplicate.
Both surfaces now fall back to the queue-scoped DB singleton lock (#1849), the
HOME-independent authority, when the pidfile shows nothing. New isLockHolderLive
keys on lock freshness (ttl + heartbeat steal-grace), never process.kill, so PID
reuse can't false-positive (pid-liveness-alone-pid-reuse). Status surfaces the
holder host/pid + recorded concurrency/max-rss from the latest started event.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… storm (#1994 #2227)

max_crashes_exceeded gave up forever, so a transient DB-pooler blip that tripped
the soft budget wedged the queue until a human restart (#2227's breaker-trips tail).
Crossing the soft budget now enters degraded mode: keep respawning with capped
exponential backoff (60s cap — a paced retry, not a hot loop) and emit a loud
crash_budget_degraded health_warn. The existing stable-run reset clears the count
once a respawn survives >5min, so a recovered DB self-heals. Permanent give-up
fires only at a much-higher hard ceiling (maxCrashes × 10), tunable/disablable via
GBRAIN_SUPERVISOR_HARD_STOP_CRASHES (0 = never). Resolves TODOS:92.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#2194)

Fan-out resolved to 4 (Postgres) regardless of worker --concurrency, so surplus
cycles queued behind the worker and raced the stalled-sweeper. Two fixes for the
same mismatch:
- resolveEffectiveFanoutMax clamps to max(1, concurrency-1) (reserve a slot),
  gated on a LIVE DB-lock holder so a stale started-audit row can't shrink
  throughput (codex #9/D5); no live holder → unknown → unclamped base. Escape
  hatch autopilot.fanout_clamp_to_concurrency.
- doctor's autopilot_fanout_concurrency check warns when fan-out exceeds
  effective slots — the misconfig was silent before. Advisory (started-event
  concurrency), wired into both doctor surfaces.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rm (#2194)

Only SUCCESS gated dispatch, so a source whose cycle kept failing/timing-out
re-fanned-out every 5-min tick forever (200+ dead jobs/24h). Now a failed source
backs off with bounded exponential cooldown (10→120min). Read at DISPATCH from
minion_jobs dead/failed rows (timeouts/RSS-kills dead-letter via SQL and never
run handler code, so a write-only hook would miss them) AND re-checked at CLAIM
time in the handler (codex #5: already-queued/retrying jobs). A success clears it
(codex #7); null-source rows excluded (codex #6); engine-parity via executeRaw.
Disable with autopilot.failure_cooldown_min=0. Fail-open if config/history reads
error. Surfaced via fanout_cooldown_skipped + the fanout summary.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ntenance job (#2194 #2227)

N per-source cycles each ran the brain-wide global phases (embed-all/orphans/
purge/…) concurrently, thrashing the same rows and taking the worker 4→10GB in
<60s → RSS-kill → orphaned stalls. Split them: per-source jobs now run only
source-scoped (+ mixed) phases and stamp last_source_cycle_at; a new
autopilot-global-maintenance job runs the global phases ONCE per window
(idempotency_key + maxWaiting:1 = structural single-flight) and stamps
autopilot.last_global_at. This is the codex-endorsed design that replaced the
rejected skip-and-stamp-fresh approach (codex #1/#2): no freshness poisoning, no
starvation — global work always runs as its own job, never marked done when it
wasn't. PHASE_SCOPE is now a runtime partition (GLOBAL ∪ NON_GLOBAL == ALL).
last_full_cycle_at still written for doctor/legacy (no longer a global gate).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
)

Follow-up to the supervisor-visibility commit: doctor's engine binding is
BrainEngine | null, so the inspectLock fallback must guard on a non-null engine
(tsc TS2345). No behavior change — a null engine simply skips the DB-lock probe
and falls back to the pidfile reading, as before.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
)

Follow-up to the fan-out/concurrency commit: the doctor-categories drift guard
requires every check name in doctor.ts to belong to exactly one category set.
Add the new autopilot_fanout_concurrency check to OPS_CHECK_NAMES (infrastructure
liveness, alongside wedged_queue/supervisor).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…raded-retry (#2194 #2227)

Post-ship document-release: refresh the KEY_FILES current-state entries that
drifted — cycle.ts (GLOBAL/NON_GLOBAL phase split + last_source_cycle_at /
autopilot.last_global_at), jobs.ts (per-source local_path brainDir, claim-time
cooldown, autopilot-global-maintenance handler), supervisor.ts + child-worker
(degraded retry instead of permanent give-up; hard ceiling), db-lock.ts
(isLockHolderLive), handler-timeouts (new handler). Regenerated llms bundle.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mpt (#1737)

The per-job timeout_at dead-letter (handleTimeouts) set status='dead' without
incrementing attempts_made, unlike the wall-clock and stall dead-letter siblings.
It is the FIRST killer to fire for the long-lane handlers (subagent / embed-backfill
/ autopilot-cycle) because timeout_ms is stamped at submit, so a timed-out long job
reported `attempts: 0/N (started: N)`. Mirror the siblings with attempts_made + 1
(terminal, no retry). Safe against double-count: the worker sweep runs handleStalled
-> handleTimeouts -> handleWallClockTimeouts sequentially and awaited, each guarded on
status='active', so the first to dead-letter excludes the row from the rest.

Regression assertions added (test/minions.test.ts + e2e/minions-resilience.test.ts)
so the increment can't be silently dropped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…freeform (#1738)

parseRunFlags() broke flag parsing at the first positional token, so any flag
after the prompt (`gbrain agent run "do X" --detach`) was swallowed into the
prompt string and silently ignored. Now the no-value switches --detach/--follow/
--no-follow are hoisted when they trail the prompt, while everything else stays
verbatim: an unknown --word is treated as prompt text (no "unknown flag" throw),
a --switch mid-prompt is preserved, and `--` suppresses hoisting entirely for a
literal escape. Value-flags now reject a missing or flag-shaped value (and
--max-turns/--timeout-ms a non-number) instead of capturing undefined/NaN.

Contract change: a prompt that starts with or trails an unguarded --word no
longer errors; a literal trailing --detach needs `--`. Help text updated; tests
revised + extended (test/agent-cli.test.ts).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Finishes the #2255 honest-freshness story for two gaps it left.

(a) `gbrain sources status` printed "idle" while a sync proc held the per-source
lock (the reported bug). New shared liveSyncStatus() helper in db-lock.ts reads
the SAME live-lock signal `gbrain doctor` uses; runStatus now shows "running"
(BACKFILL column + a sync_running field in --json) and suppresses the misleading
"never synced" warning while a sync is live. One helper, so the surfaces can't
drift (doctor/status retrofit tracked as a follow-up).

(b) A sync wedged-but-alive kept refreshing its lock heartbeat (it fires on its
own timer) and hadn't hit the wall-clock deadline, so only a manual pkill freed
it. New in-band stall watchdog keys off FORWARD IMPORT PROGRESS (progress.tick),
not the heartbeat: if no file completes for GBRAIN_SYNC_STALL_ABORT_SECONDS
(default 900s), it aborts via a controller composed into opts.signal, so the
drain returns partial() (last_commit unchanged, next run resumes from the
checkpoint) and withRefreshingLock releases the lock. Limits, documented in
code: a single file slower than the window trips it; a fully starved event loop
won't fire the timer (the wall-clock hard deadline is that backstop).

Tests: liveSyncStatus (live/expired/none/per-source) in db-lock-inspect; the
resolveStallAbortSeconds env matrix in sync-hard-deadline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`gbrain status` had no version in its JSON envelope and could hang on a slow
connection with no way to get a partial answer. Two additions:

- version: the StatusReport JSON now carries the local gbrain CLI version so a
  poller can pin behavior to a build. Thin-client also surfaces remote_version
  (the brain server's version), and the get_status_snapshot MCP op reports its
  version for that parity.
- --deadline-ms=N / --fast: a shared wall-clock budget. Each section is raced
  against the REMAINING budget via Promise.race (NOT process-watchdog, which
  SIGKILLs and can't return partial output), so one slow/hung section can't
  strand the snapshot — it's marked stale and the rest still return. The
  envelope gains partial:true + stale_sections[]; exit code stays 0 (a snapshot
  was produced). Invalid --deadline-ms → exit 2.

Tests: parseDeadlineFlag + withSectionDeadline (hermetic), the usage-error exit,
version presence in the PGLite envelope, and the op's version key.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#1950)

Pre-landing review (codex + adversarial): the stall watchdog aborted opts.signal
but the per-iteration abort checks returned partial('timeout'), collapsing a
wedge-reap into a user --timeout/SIGINT so JSON consumers couldn't tell them
apart. Add a 'stall_timeout' reason (set via a stallAborted flag) on the three
import-loop abort sites; deletes/renames-phase and checkpoint sites stay 'timeout'.
Sharpen the watchdog comment: the abort is observed BETWEEN files, so a hang
inside a single importFile is not interrupted until it returns (TODO: thread a
cancellation signal through importFile).
…1738)

Pre-landing review: the leading-flag loop breaks at the first positional, so the
`escaped` flag only fired for a leading `--`. A `--` placed after a positional
left trailing-switch hoisting active, so `agent run note -- body --detach`
silently detached and dropped the `--` as junk. Suppress hoisting whenever a
literal `--` appears in the prompt. Regression test added.
… losing remote call (#1984)

Pre-landing review (codex): (1) bare `--deadline-ms` with no value silently fell
through to no-budget/--fast instead of a usage error; (2) thin-client timeout
reported both sync+cycle stale even under `--section sync`, naming a section the
caller excluded (local path was already correct); (3) the section race abandoned
the remote promise locally but didn't cancel the in-flight MCP call — pass the
budget as timeoutMs so the losing side actually cancels. Regression test added.
…dge + sync/status/minion reliability (#2194 #2227 #1994 #1737 #1738 #1950 #1984)

Bundles the already-reviewed autopilot/supervisor stabilization (#2194 #2227
#1994: cycle split, per-source failure cooldown, fan-out clamp, degraded
supervisor retry, DB-lock live-supervisor detection) with four operational
fixes: minion timeout attempt-accounting (#1737), agent-run trailing-flag
parsing (#1738), honest live-sync sources status + progress-aware stall
watchdog (#1950), and status version + --deadline-ms partial result (#1984).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
garrytan added 2 commits June 18, 2026 11:06
Post-ship doc sync (/document-release): add the sync stall watchdog env var to
the CLAUDE.md sync-tuning table (Five → Six knobs) + regenerate the llms bundle.
…2194)

The cherry-picked autopilot-fanout-clamp + doctor-autopilot-fanout-concurrency
tests mutate process.env.GBRAIN_AUDIT_DIR in beforeEach/afterEach, which the
check:test-isolation R1 lint flags (parallel shards load multiple files per
process). Rename to *.serial.test.ts (sanctioned quarantine — they run under
--max-concurrency=1) instead of restructuring the reviewed test bodies. No logic
change; both files stay green (9 tests). Fixes the failing verify CI check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge

1 participant