v0.42.52.0 fix(reliability): autopilot dead-job storm + supervisor wedge + sync/status/minion reliability (#2194 #2227 #1994 #1737 #1738 #1950 #1984)#2287
Open
garrytan wants to merge 20 commits into
Conversation
#2227) A duplicate supervisor loses the queue-scoped DB singleton lock (#1849) and exits LOCK_HELD before spawning a worker or emitting 'started'. summarizeCrashes counts only worker_exited, so the fence path is structurally uncountable. Pin it so a future refactor that logs worker_exited on the fence path fails here instead of silently re-introducing the crash-budget breaker-trip loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…, not global repo (#2194 #2227) A per-source autopilot-cycle inherited the global sync.repo_path as brainDir while stamping DB freshness for source_id — mixed scope. FS phases (sync/lint/extract) ran against the wrong tree, so the failure-cooldown and freshness gates would attribute work to the wrong source. Resolve the source's local_path in the handler (reuse the archive-recheck SELECT) and bind brainDir to it; a pure-DB source gets null (FS phases skip) instead of falling through to the global checkout. Legacy no-source dispatch keeps the global repoPath. Prerequisite for the cooldown/split commits (codex outside-voice #8). Resolves TODOS:634. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… $HOME (#2227) jobs supervisor status + doctor read the HOME-derived pidfile, so a supervisor started under a different $HOME (keeper=/root vs ops=/data) read as 'not running' while healthy — the false signal that drives an operator to spawn a duplicate. Both surfaces now fall back to the queue-scoped DB singleton lock (#1849), the HOME-independent authority, when the pidfile shows nothing. New isLockHolderLive keys on lock freshness (ttl + heartbeat steal-grace), never process.kill, so PID reuse can't false-positive (pid-liveness-alone-pid-reuse). Status surfaces the holder host/pid + recorded concurrency/max-rss from the latest started event. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… storm (#1994 #2227) max_crashes_exceeded gave up forever, so a transient DB-pooler blip that tripped the soft budget wedged the queue until a human restart (#2227's breaker-trips tail). Crossing the soft budget now enters degraded mode: keep respawning with capped exponential backoff (60s cap — a paced retry, not a hot loop) and emit a loud crash_budget_degraded health_warn. The existing stable-run reset clears the count once a respawn survives >5min, so a recovered DB self-heals. Permanent give-up fires only at a much-higher hard ceiling (maxCrashes × 10), tunable/disablable via GBRAIN_SUPERVISOR_HARD_STOP_CRASHES (0 = never). Resolves TODOS:92. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#2194) Fan-out resolved to 4 (Postgres) regardless of worker --concurrency, so surplus cycles queued behind the worker and raced the stalled-sweeper. Two fixes for the same mismatch: - resolveEffectiveFanoutMax clamps to max(1, concurrency-1) (reserve a slot), gated on a LIVE DB-lock holder so a stale started-audit row can't shrink throughput (codex #9/D5); no live holder → unknown → unclamped base. Escape hatch autopilot.fanout_clamp_to_concurrency. - doctor's autopilot_fanout_concurrency check warns when fan-out exceeds effective slots — the misconfig was silent before. Advisory (started-event concurrency), wired into both doctor surfaces. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rm (#2194) Only SUCCESS gated dispatch, so a source whose cycle kept failing/timing-out re-fanned-out every 5-min tick forever (200+ dead jobs/24h). Now a failed source backs off with bounded exponential cooldown (10→120min). Read at DISPATCH from minion_jobs dead/failed rows (timeouts/RSS-kills dead-letter via SQL and never run handler code, so a write-only hook would miss them) AND re-checked at CLAIM time in the handler (codex #5: already-queued/retrying jobs). A success clears it (codex #7); null-source rows excluded (codex #6); engine-parity via executeRaw. Disable with autopilot.failure_cooldown_min=0. Fail-open if config/history reads error. Surfaced via fanout_cooldown_skipped + the fanout summary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ntenance job (#2194 #2227) N per-source cycles each ran the brain-wide global phases (embed-all/orphans/ purge/…) concurrently, thrashing the same rows and taking the worker 4→10GB in <60s → RSS-kill → orphaned stalls. Split them: per-source jobs now run only source-scoped (+ mixed) phases and stamp last_source_cycle_at; a new autopilot-global-maintenance job runs the global phases ONCE per window (idempotency_key + maxWaiting:1 = structural single-flight) and stamps autopilot.last_global_at. This is the codex-endorsed design that replaced the rejected skip-and-stamp-fresh approach (codex #1/#2): no freshness poisoning, no starvation — global work always runs as its own job, never marked done when it wasn't. PHASE_SCOPE is now a runtime partition (GLOBAL ∪ NON_GLOBAL == ALL). last_full_cycle_at still written for doctor/legacy (no longer a global gate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
) Follow-up to the supervisor-visibility commit: doctor's engine binding is BrainEngine | null, so the inspectLock fallback must guard on a non-null engine (tsc TS2345). No behavior change — a null engine simply skips the DB-lock probe and falls back to the pidfile reading, as before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
) Follow-up to the fan-out/concurrency commit: the doctor-categories drift guard requires every check name in doctor.ts to belong to exactly one category set. Add the new autopilot_fanout_concurrency check to OPS_CHECK_NAMES (infrastructure liveness, alongside wedged_queue/supervisor). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…raded-retry (#2194 #2227) Post-ship document-release: refresh the KEY_FILES current-state entries that drifted — cycle.ts (GLOBAL/NON_GLOBAL phase split + last_source_cycle_at / autopilot.last_global_at), jobs.ts (per-source local_path brainDir, claim-time cooldown, autopilot-global-maintenance handler), supervisor.ts + child-worker (degraded retry instead of permanent give-up; hard ceiling), db-lock.ts (isLockHolderLive), handler-timeouts (new handler). Regenerated llms bundle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mpt (#1737) The per-job timeout_at dead-letter (handleTimeouts) set status='dead' without incrementing attempts_made, unlike the wall-clock and stall dead-letter siblings. It is the FIRST killer to fire for the long-lane handlers (subagent / embed-backfill / autopilot-cycle) because timeout_ms is stamped at submit, so a timed-out long job reported `attempts: 0/N (started: N)`. Mirror the siblings with attempts_made + 1 (terminal, no retry). Safe against double-count: the worker sweep runs handleStalled -> handleTimeouts -> handleWallClockTimeouts sequentially and awaited, each guarded on status='active', so the first to dead-letter excludes the row from the rest. Regression assertions added (test/minions.test.ts + e2e/minions-resilience.test.ts) so the increment can't be silently dropped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…freeform (#1738) parseRunFlags() broke flag parsing at the first positional token, so any flag after the prompt (`gbrain agent run "do X" --detach`) was swallowed into the prompt string and silently ignored. Now the no-value switches --detach/--follow/ --no-follow are hoisted when they trail the prompt, while everything else stays verbatim: an unknown --word is treated as prompt text (no "unknown flag" throw), a --switch mid-prompt is preserved, and `--` suppresses hoisting entirely for a literal escape. Value-flags now reject a missing or flag-shaped value (and --max-turns/--timeout-ms a non-number) instead of capturing undefined/NaN. Contract change: a prompt that starts with or trails an unguarded --word no longer errors; a literal trailing --detach needs `--`. Help text updated; tests revised + extended (test/agent-cli.test.ts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Finishes the #2255 honest-freshness story for two gaps it left. (a) `gbrain sources status` printed "idle" while a sync proc held the per-source lock (the reported bug). New shared liveSyncStatus() helper in db-lock.ts reads the SAME live-lock signal `gbrain doctor` uses; runStatus now shows "running" (BACKFILL column + a sync_running field in --json) and suppresses the misleading "never synced" warning while a sync is live. One helper, so the surfaces can't drift (doctor/status retrofit tracked as a follow-up). (b) A sync wedged-but-alive kept refreshing its lock heartbeat (it fires on its own timer) and hadn't hit the wall-clock deadline, so only a manual pkill freed it. New in-band stall watchdog keys off FORWARD IMPORT PROGRESS (progress.tick), not the heartbeat: if no file completes for GBRAIN_SYNC_STALL_ABORT_SECONDS (default 900s), it aborts via a controller composed into opts.signal, so the drain returns partial() (last_commit unchanged, next run resumes from the checkpoint) and withRefreshingLock releases the lock. Limits, documented in code: a single file slower than the window trips it; a fully starved event loop won't fire the timer (the wall-clock hard deadline is that backstop). Tests: liveSyncStatus (live/expired/none/per-source) in db-lock-inspect; the resolveStallAbortSeconds env matrix in sync-hard-deadline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`gbrain status` had no version in its JSON envelope and could hang on a slow connection with no way to get a partial answer. Two additions: - version: the StatusReport JSON now carries the local gbrain CLI version so a poller can pin behavior to a build. Thin-client also surfaces remote_version (the brain server's version), and the get_status_snapshot MCP op reports its version for that parity. - --deadline-ms=N / --fast: a shared wall-clock budget. Each section is raced against the REMAINING budget via Promise.race (NOT process-watchdog, which SIGKILLs and can't return partial output), so one slow/hung section can't strand the snapshot — it's marked stale and the rest still return. The envelope gains partial:true + stale_sections[]; exit code stays 0 (a snapshot was produced). Invalid --deadline-ms → exit 2. Tests: parseDeadlineFlag + withSectionDeadline (hermetic), the usage-error exit, version presence in the PGLite envelope, and the op's version key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#1950) Pre-landing review (codex + adversarial): the stall watchdog aborted opts.signal but the per-iteration abort checks returned partial('timeout'), collapsing a wedge-reap into a user --timeout/SIGINT so JSON consumers couldn't tell them apart. Add a 'stall_timeout' reason (set via a stallAborted flag) on the three import-loop abort sites; deletes/renames-phase and checkpoint sites stay 'timeout'. Sharpen the watchdog comment: the abort is observed BETWEEN files, so a hang inside a single importFile is not interrupted until it returns (TODO: thread a cancellation signal through importFile).
…1738) Pre-landing review: the leading-flag loop breaks at the first positional, so the `escaped` flag only fired for a leading `--`. A `--` placed after a positional left trailing-switch hoisting active, so `agent run note -- body --detach` silently detached and dropped the `--` as junk. Suppress hoisting whenever a literal `--` appears in the prompt. Regression test added.
… losing remote call (#1984) Pre-landing review (codex): (1) bare `--deadline-ms` with no value silently fell through to no-budget/--fast instead of a usage error; (2) thin-client timeout reported both sync+cycle stale even under `--section sync`, naming a section the caller excluded (local path was already correct); (3) the section race abandoned the remote promise locally but didn't cancel the in-flight MCP call — pass the budget as timeoutMs so the losing side actually cancels. Regression test added.
…dge + sync/status/minion reliability (#2194 #2227 #1994 #1737 #1738 #1950 #1984) Bundles the already-reviewed autopilot/supervisor stabilization (#2194 #2227 #1994: cycle split, per-source failure cooldown, fan-out clamp, degraded supervisor retry, DB-lock live-supervisor detection) with four operational fixes: minion timeout attempt-accounting (#1737), agent-run trailing-flag parsing (#1738), honest live-sync sources status + progress-aware stall watchdog (#1950), and status version + --deadline-ms partial result (#1984). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Post-ship doc sync (/document-release): add the sync stall watchdog env var to the CLAUDE.md sync-tuning table (Five → Six knobs) + regenerate the llms bundle.
…2194) The cherry-picked autopilot-fanout-clamp + doctor-autopilot-fanout-concurrency tests mutate process.env.GBRAIN_AUDIT_DIR in beforeEach/afterEach, which the check:test-isolation R1 lint flags (parallel shards load multiple files per process). Rename to *.serial.test.ts (sanctioned quarantine — they run under --max-concurrency=1) instead of restructuring the reviewed test bodies. No logic change; both files stay green (9 tests). Fixes the failing verify CI check.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
One reliability fix wave, scoped from the open
garrytan-agentsissue backlog. It lands the already-reviewed autopilot/supervisor stabilization (previously stranded as an unmerged PR) plus four bounded operational fixes found alongside.On a multi-source Postgres brain, autopilot could fan out a continuous stream of dead
autopilot-cyclejobs while the supervisor periodically wedged the very queue it exists to keep alive — one disease with several interacting parts. This addresses all of them, then cleans up four smaller reliability bugs.What's in the wave
Autopilot / supervisor (Closes #2194, #2227, #1994) — replayed from the reviewed work, commit-for-commit:
autopilot-global-maintenancejob runs brain-wide phases once per window — removes the per-cycle memory blow-up that was the shared root cause.$HOME-derived pidfile).gbrain doctormismatch check.Operational fixes:
handleTimeoutswas the one dead-letter path that didn't increment, so long-lane jobs readattempts: 0/N). (Attempt-accounting half; the queue-starvation half stays a measurement-gated v0.43+ follow-up — see TODOS.)gbrain agent runno longer swallows a trailing--detach/--followinto the prompt; a--wordinside the prompt stays verbatim;--ends flag parsing anywhere.gbrain sources statusreports a live sync as running (not "idle") via a sharedliveSyncStatushelper; a progress-aware stall watchdog aborts a wedged drain and releases its lock so the next sync resumes from the checkpoint (no manualpkill). (In-flight single-file abort is a documented follow-up in TODOS.)gbrain statusgains aversionfield and a--deadline-ms/--fastper-section budget that returns a partial envelope instead of hanging a poller.Pre-landing review
/plan-eng-review(3 deep-review agents + codex outside voice) hardened the design; 4 decisions resolved with the author.stall_timeout),--deadline-msmissing-value is a usage error, thin-client stale-section reporting scoped to requested sections + losing remote call cancelled, and the--escape now suppresses hoisting after a positional. Each fix has a regression test.Test plan
test/cli.test.ts18/18).bun run typecheckclean. Targeted suites green: v0.42.50.0 fix(autopilot): kill the dead-job storm + supervisor queue wedge (#2194 #2227 #1994) #2249 (108),agent-cli+status-sections(52),cli(18), minionshandleTimeouts.🤖 Generated with Claude Code