Skip to content

fix(autopilot): resilience + hot-memory hardening (per-phase timeout, dispatch probe, op-checkpoint, facts visibility)#2268

Open
JiraiyaETH wants to merge 7 commits into
garrytan:masterfrom
JiraiyaETH:jarvis/autopilot-resilience-20260617
Open

fix(autopilot): resilience + hot-memory hardening (per-phase timeout, dispatch probe, op-checkpoint, facts visibility)#2268
JiraiyaETH wants to merge 7 commits into
garrytan:masterfrom
JiraiyaETH:jarvis/autopilot-resilience-20260617

Conversation

@JiraiyaETH

Copy link
Copy Markdown

Autopilot resilience + hot-memory hardening (6 commits + a test fix). All off origin/master.

Commits

  • minions: env-resolvable + clamped DB-probe self-defense thresholds.
  • frontmatter: derive the validate slug from the brain root (git rev-parse) so it's stable regardless of CWD.
  • cycle: per-phase soft timeout + an autopilot git-commit phase — a slow phase aborts at its next batch boundary, self-truncates (status:'skipped', reason:'phase_timeout'), and the cycle continues instead of dying at the job wall-clock. Partial work is durable (watermarks) and resumes next cycle.
  • autopilot: bound the dispatch health-probe + enable the per-phase timeout on cycles.
  • facts: facts.default_visibility knob (configurable hot-memory default).
  • op-checkpoint: bind completed_keys via to_jsonb(text[]) instead of JSON.stringify + $3::jsonb (the latter double-encodes → jsonb string scalar → violates the op_checkpoints_completed_keys_array CHECK). (v0.42.51.0 fix(sync): contention-free clock + checkpoint integrity + honest sync freshness #2255 follow-up.)
  • test(cycle): pass a real AbortSignal to the autopilot-cycle handler tests. runPhaseWithTimeout registers an abort listener via parentSignal.addEventListener; the tests passed a fake { aborted: false } as any mock with no addEventListener, so they threw once the timeout path engaged. The real worker passes a genuine AbortSignal (production cycles run fine). No production code change.

Tests

Full suite green in a clean environment. (Two tests are environment-sensitive only when run against a live machine: resolveBoostMap reads GBRAIN_SOURCE_BOOST from the ambient shell, and worker-registry.serial contends with a running autopilot writing under ~/.gbrain — both pass in isolated CI.)

🤖 Generated with Claude Code

Jarvis and others added 7 commits June 18, 2026 07:10
GBRAIN_DB_FAIL_EXIT_AFTER (clamped [1,10]) and GBRAIN_DB_PROBE_TIMEOUT_MS
(clamped [10000,120000]) make the worker's hardcoded 3-probe/10s liveness
self-exit tunable for a high-latency pooler, where a transient wedge otherwise
self-exits a healthy worker mid-cycle. Reuses parsePositiveInt; explicit opts
still win. The upper clamp stops a bad config from masking a dead DB.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…arse)

`gbrain frontmatter validate <file>` derived the path slug from the ABSOLUTE
path in single-file mode (relative(target,file)='' -> abs-path fallback),
spuriously failing SLUG_MISMATCH for every explicit-slug file -- the pre-commit
hook's exact usage. Now uses `git rev-parse --show-toplevel` (handles worktrees,
submodules, symlinks) with the manual walk-up as fallback. Also fixes subdir mode.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per-phase timeout (GBRAIN_PHASE_TIMEOUT_SECONDS / CycleOpts.phaseTimeoutSeconds):
heavy phases (resolve_symbol_edges, embed) abort at a batch boundary on the
deadline, recorded skipped:phase_timeout with partial counts preserved, and the
cycle CONTINUES instead of dying at the 1800s job timeout. Watermark-safe resume.
resolve_symbol_edges made abort-aware (mirrors embed's garrytan#1737). The relabel only
fires for a genuine abort -- a real phase error is never swallowed as a timeout.

Commit phase (cycle.commit, runs last; PHASE_SCOPE 'global' -> commits the repo
once per cycle, not per-source): stage + timestamp-commit + push the brain tree,
reusing endorse.ts's execFileSync git pattern. All git calls bounded (30s timeout,
GIT_TERMINAL_PROMPT=0) so a hung/credential-prompting git can't block the event
loop under the cycle lock. Non-git dir skipped, clean tree no-op, hook-reject ->
fail, push failure non-fatal + credential-redacted. Closes the gap where autopilot
never committed brain pages (sync only reads already-committed content).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…imeout on cycles

The autopilot main loop awaited engine.getConfig('version') unbounded; a half-open
pooler connection that never settles parked the loop forever (the alive-but-idle
wedge: process up, logs frozen, 0 cycles enqueued) -- uncaught by the worker
DB-probe and the MinionSupervisor watchdog (autopilot uses ChildWorkerSupervisor).
Bound the probe (GBRAIN_AUTOPILOT_PROBE_TIMEOUT_MS, default 10s) with the worker's
probeWithTimeout idiom so a hang throws into the existing reconnect/exit machinery
(launchd relaunch). The autopilot-cycle handler now passes phaseTimeoutSeconds
(default 600s, env-overridable) so the per-phase timeout is active on the autopilot
path; other runCycle callers (dream, manual) leave it off.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…efault)

New facts hardcoded visibility='private' (backstop.ts), so autopilot/put_page/
sync-extracted facts never surfaced in remote 'recall' (world-only). Add a
facts.default_visibility config knob (default 'private', safe) honored at the
backstop write chokepoint; an explicit ctx.visibility (the extract_facts MCP op)
still wins. `gbrain config set facts.default_visibility world` then makes new
facts default 'world' for hot memory; the fence-write serializes it so the cycle
extract_facts reconcile preserves it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…N.stringify+$3::jsonb

recordCompleted bound `JSON.stringify(sorted)` to a positional `$3::jsonb` cast.
postgres.js DOUBLE-ENCODES that (the JSON string is re-stringified into a jsonb
STRING scalar, not an array) — the CLAUDE.md / postgres-engine.ts double-encode
trap. v0.42.51.0 (garrytan#2255) added the `op_checkpoints_completed_keys_array` CHECK
(jsonb_typeof = 'array') without fixing this writer, so every non-empty
recordCompleted write (e.g. sync's sync-target pin) now violates the constraint
and aborts the sync ("checkpoint target write failed"). Bind the keys as a
text[] and convert with to_jsonb($3::text[]) — mirrors appendCompleted's
unnest($3::text[]) idiom in the same file; verified to yield a true jsonb array.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ests

The per-phase-timeout work (runPhaseWithTimeout) registers an 'abort'
listener on the parent signal via addEventListener. The handler tests were
passing a fake `{ aborted: false } as any` mock that has no addEventListener,
so those 4 tests threw TypeError once the timeout path engaged. The real
worker passes a genuine AbortSignal (production cycles run fine). Swap the
mocks for `new AbortController().signal`. No production code change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant