v0.42.50.0 fix(autopilot): kill the dead-job storm + supervisor queue wedge (#2194 #2227 #1994)#2249
Closed
garrytan wants to merge 13 commits into
Closed
v0.42.50.0 fix(autopilot): kill the dead-job storm + supervisor queue wedge (#2194 #2227 #1994)#2249garrytan wants to merge 13 commits into
garrytan wants to merge 13 commits into
Conversation
#2227) A duplicate supervisor loses the queue-scoped DB singleton lock (#1849) and exits LOCK_HELD before spawning a worker or emitting 'started'. summarizeCrashes counts only worker_exited, so the fence path is structurally uncountable. Pin it so a future refactor that logs worker_exited on the fence path fails here instead of silently re-introducing the crash-budget breaker-trip loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…, not global repo (#2194 #2227) A per-source autopilot-cycle inherited the global sync.repo_path as brainDir while stamping DB freshness for source_id — mixed scope. FS phases (sync/lint/extract) ran against the wrong tree, so the failure-cooldown and freshness gates would attribute work to the wrong source. Resolve the source's local_path in the handler (reuse the archive-recheck SELECT) and bind brainDir to it; a pure-DB source gets null (FS phases skip) instead of falling through to the global checkout. Legacy no-source dispatch keeps the global repoPath. Prerequisite for the cooldown/split commits (codex outside-voice #8). Resolves TODOS:634. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… $HOME (#2227) jobs supervisor status + doctor read the HOME-derived pidfile, so a supervisor started under a different $HOME (keeper=/root vs ops=/data) read as 'not running' while healthy — the false signal that drives an operator to spawn a duplicate. Both surfaces now fall back to the queue-scoped DB singleton lock (#1849), the HOME-independent authority, when the pidfile shows nothing. New isLockHolderLive keys on lock freshness (ttl + heartbeat steal-grace), never process.kill, so PID reuse can't false-positive (pid-liveness-alone-pid-reuse). Status surfaces the holder host/pid + recorded concurrency/max-rss from the latest started event. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… storm (#1994 #2227) max_crashes_exceeded gave up forever, so a transient DB-pooler blip that tripped the soft budget wedged the queue until a human restart (#2227's breaker-trips tail). Crossing the soft budget now enters degraded mode: keep respawning with capped exponential backoff (60s cap — a paced retry, not a hot loop) and emit a loud crash_budget_degraded health_warn. The existing stable-run reset clears the count once a respawn survives >5min, so a recovered DB self-heals. Permanent give-up fires only at a much-higher hard ceiling (maxCrashes × 10), tunable/disablable via GBRAIN_SUPERVISOR_HARD_STOP_CRASHES (0 = never). Resolves TODOS:92. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#2194) Fan-out resolved to 4 (Postgres) regardless of worker --concurrency, so surplus cycles queued behind the worker and raced the stalled-sweeper. Two fixes for the same mismatch: - resolveEffectiveFanoutMax clamps to max(1, concurrency-1) (reserve a slot), gated on a LIVE DB-lock holder so a stale started-audit row can't shrink throughput (codex #9/D5); no live holder → unknown → unclamped base. Escape hatch autopilot.fanout_clamp_to_concurrency. - doctor's autopilot_fanout_concurrency check warns when fan-out exceeds effective slots — the misconfig was silent before. Advisory (started-event concurrency), wired into both doctor surfaces. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rm (#2194) Only SUCCESS gated dispatch, so a source whose cycle kept failing/timing-out re-fanned-out every 5-min tick forever (200+ dead jobs/24h). Now a failed source backs off with bounded exponential cooldown (10→120min). Read at DISPATCH from minion_jobs dead/failed rows (timeouts/RSS-kills dead-letter via SQL and never run handler code, so a write-only hook would miss them) AND re-checked at CLAIM time in the handler (codex #5: already-queued/retrying jobs). A success clears it (codex #7); null-source rows excluded (codex #6); engine-parity via executeRaw. Disable with autopilot.failure_cooldown_min=0. Fail-open if config/history reads error. Surfaced via fanout_cooldown_skipped + the fanout summary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ntenance job (#2194 #2227) N per-source cycles each ran the brain-wide global phases (embed-all/orphans/ purge/…) concurrently, thrashing the same rows and taking the worker 4→10GB in <60s → RSS-kill → orphaned stalls. Split them: per-source jobs now run only source-scoped (+ mixed) phases and stamp last_source_cycle_at; a new autopilot-global-maintenance job runs the global phases ONCE per window (idempotency_key + maxWaiting:1 = structural single-flight) and stamps autopilot.last_global_at. This is the codex-endorsed design that replaced the rejected skip-and-stamp-fresh approach (codex #1/#2): no freshness poisoning, no starvation — global work always runs as its own job, never marked done when it wasn't. PHASE_SCOPE is now a runtime partition (GLOBAL ∪ NON_GLOBAL == ALL). last_full_cycle_at still written for doctor/legacy (no longer a global gate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
) Follow-up to the supervisor-visibility commit: doctor's engine binding is BrainEngine | null, so the inspectLock fallback must guard on a non-null engine (tsc TS2345). No behavior change — a null engine simply skips the DB-lock probe and falls back to the pidfile reading, as before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
) Follow-up to the fan-out/concurrency commit: the doctor-categories drift guard requires every check name in doctor.ts to belong to exactly one category set. Add the new autopilot_fanout_concurrency check to OPS_CHECK_NAMES (infrastructure liveness, alongside wedged_queue/supervisor). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… wedge (#2194 #2227 #1994) Split the per-source cycle (source phases fan out; one global-maintenance job runs the brain-wide phases once) to end the 4→10GB RSS blowout; add per-source failure cooldown, fan-out-to-concurrency clamp + doctor warning, supervisor degraded-retry instead of permanent give-up, and DB-lock supervisor visibility under split $HOME. Resolves TODOS:634 (repoPath) and TODOS:92 (#1994 backoff). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…raded-retry (#2194 #2227) Post-ship document-release: refresh the KEY_FILES current-state entries that drifted — cycle.ts (GLOBAL/NON_GLOBAL phase split + last_source_cycle_at / autopilot.last_global_at), jobs.ts (per-source local_path brainDir, claim-time cooldown, autopilot-global-maintenance handler), supervisor.ts + child-worker (degraded retry instead of permanent give-up; hard ceiling), db-lock.ts (isLockHolderLive), handler-timeouts (new handler). Regenerated llms bundle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-wedge-fix # Conflicts: # CHANGELOG.md # VERSION # package.json
Owner
Author
|
Superseded by #2287. This branch's 8 functional commits + the KEY_FILES doc update were replayed commit-for-commit onto current master in #2287 (re-versioned v0.42.50.0 → v0.42.52.0, since 0.42.50.0/0.42.51.0 were taken by #2254/#2255), bundled with four adjacent reliability fixes (#1737 #1738 #1950 #1984). #2287 closes #2194/#2227/#1994. Closing this in favor of #2287. |
garrytan
added a commit
that referenced
this pull request
Jun 18, 2026
…2194) The cherry-picked autopilot-fanout-clamp + doctor-autopilot-fanout-concurrency tests mutate process.env.GBRAIN_AUDIT_DIR in beforeEach/afterEach, which the check:test-isolation R1 lint flags (parallel shards load multiple files per process). Rename to *.serial.test.ts (sanctioned quarantine — they run under --max-concurrency=1) instead of restructuring the reviewed test bodies. No logic change; both files stay green (9 tests). Fixes the failing verify CI check.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the recurring failure mode on multi-source Postgres brains where autopilot manufactured a continuous stream of dead
autopilot-cyclejobs and the minion supervisor periodically wedged the queue it is meant to keep alive. Root cause was one disease with several interacting parts; this wave addresses all of them. Closes gbrain#2194, gbrain#2227, gbrain#1994.What changed (8 atomic commits)
test— pin that the duplicate-supervisor LOCK_HELD fence-exit is never counted as a crash (Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227).fix(prerequisite) — per-source cycles bind filesystem phases tosource.local_path, not the global repo path, so FS/DB phases and the freshness stamp agree on the source (autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge #2194/Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227, TODOS:634).fix—gbrain jobs supervisor status/gbrain doctordetect a live supervisor via the queue-scoped DB lock when the$HOME-derived pidfile is absent. Keyed on lock freshness, never a bare PID probe (PID-reuse-safe) (Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227).fix— supervisor degrades to capped-backoff retry instead of permanently giving up at the soft crash budget; self-heals on a transient DB blip; hard ceiling viaGBRAIN_SUPERVISOR_HARD_STOP_CRASHES(supervisor: max_crashes_exceeded gives up permanently on transient DB outages — needs retry backoff instead of hard stop #1994/Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227, TODOS:92).feat— clamp per-tick fan-out tomax(1, concurrency−1)(gated on a live supervisor) + agbrain doctorwarning on mismatch (autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge #2194).feat— per-source failure cooldown (bounded exponential 10→120min), checked at dispatch AND claim time; a success clears it (autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge #2194).feat— split the cycle: per-source jobs run only source-scoped phases; a singleautopilot-global-maintenancejob runs the brain-wide phases once per window (structural single-flight). This kills the 4→10GB RSS blowout that was the shared root cause of autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge #2194 and Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227 (autopilot: multi-source Postgres fan-out produces continuous dead autopilot-cycle jobs + recurrent queue wedge #2194/Recurring queue wedge: dueling supervisors under split $HOME + per-cycle RSS blowout trip crash-budget breaker on the healthy supervisor #2227).fix— categorize the new doctor check + a nullable-engine guard.Why the split (the load-bearing decision)
The plan first proposed single-flight-skip (concurrent cycles skip the global phases if another holds a lock). A codex outside-voice review during
/plan-eng-reviewcaught that skip-then-stamp-fresh poisons the only freshness gate — a source gets marked done while its new rows go unembedded. The split design (per-source phases + one global-maintenance job) is the codebase's documented Phase-2 design: no skip semantics, no freshness poisoning, single-flight by construction.New config / surface
autopilot-global-maintenanceautopilot.failure_cooldown_min(0=off),autopilot.failure_cooldown_cap_min,autopilot.fanout_clamp_to_concurrency,autopilot.global_floor_minGBRAIN_SUPERVISOR_HARD_STOP_CRASHESautopilot_fanout_concurrencylast_source_cycle_at; brain config:autopilot.last_global_atAll on by default; existing brains pick them up on the next tick with no migration (one catch-up global pass on first tick).
Test plan
bun run typecheck— clean.test/e2e/autopilot-cooldown-parity.test.ts(PGLite half local; Postgres half runs in CI)./plan-eng-review+ codex outside-voice (logged CLEAR).🤖 Generated with Claude Code