fix(autopilot): resilience + hot-memory hardening (per-phase timeout, dispatch probe, op-checkpoint, facts visibility)#2268
Open
JiraiyaETH wants to merge 7 commits into
Conversation
GBRAIN_DB_FAIL_EXIT_AFTER (clamped [1,10]) and GBRAIN_DB_PROBE_TIMEOUT_MS (clamped [10000,120000]) make the worker's hardcoded 3-probe/10s liveness self-exit tunable for a high-latency pooler, where a transient wedge otherwise self-exits a healthy worker mid-cycle. Reuses parsePositiveInt; explicit opts still win. The upper clamp stops a bad config from masking a dead DB. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…arse) `gbrain frontmatter validate <file>` derived the path slug from the ABSOLUTE path in single-file mode (relative(target,file)='' -> abs-path fallback), spuriously failing SLUG_MISMATCH for every explicit-slug file -- the pre-commit hook's exact usage. Now uses `git rev-parse --show-toplevel` (handles worktrees, submodules, symlinks) with the manual walk-up as fallback. Also fixes subdir mode. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per-phase timeout (GBRAIN_PHASE_TIMEOUT_SECONDS / CycleOpts.phaseTimeoutSeconds): heavy phases (resolve_symbol_edges, embed) abort at a batch boundary on the deadline, recorded skipped:phase_timeout with partial counts preserved, and the cycle CONTINUES instead of dying at the 1800s job timeout. Watermark-safe resume. resolve_symbol_edges made abort-aware (mirrors embed's garrytan#1737). The relabel only fires for a genuine abort -- a real phase error is never swallowed as a timeout. Commit phase (cycle.commit, runs last; PHASE_SCOPE 'global' -> commits the repo once per cycle, not per-source): stage + timestamp-commit + push the brain tree, reusing endorse.ts's execFileSync git pattern. All git calls bounded (30s timeout, GIT_TERMINAL_PROMPT=0) so a hung/credential-prompting git can't block the event loop under the cycle lock. Non-git dir skipped, clean tree no-op, hook-reject -> fail, push failure non-fatal + credential-redacted. Closes the gap where autopilot never committed brain pages (sync only reads already-committed content). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…imeout on cycles
The autopilot main loop awaited engine.getConfig('version') unbounded; a half-open
pooler connection that never settles parked the loop forever (the alive-but-idle
wedge: process up, logs frozen, 0 cycles enqueued) -- uncaught by the worker
DB-probe and the MinionSupervisor watchdog (autopilot uses ChildWorkerSupervisor).
Bound the probe (GBRAIN_AUTOPILOT_PROBE_TIMEOUT_MS, default 10s) with the worker's
probeWithTimeout idiom so a hang throws into the existing reconnect/exit machinery
(launchd relaunch). The autopilot-cycle handler now passes phaseTimeoutSeconds
(default 600s, env-overridable) so the per-phase timeout is active on the autopilot
path; other runCycle callers (dream, manual) leave it off.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…efault) New facts hardcoded visibility='private' (backstop.ts), so autopilot/put_page/ sync-extracted facts never surfaced in remote 'recall' (world-only). Add a facts.default_visibility config knob (default 'private', safe) honored at the backstop write chokepoint; an explicit ctx.visibility (the extract_facts MCP op) still wins. `gbrain config set facts.default_visibility world` then makes new facts default 'world' for hot memory; the fence-write serializes it so the cycle extract_facts reconcile preserves it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…N.stringify+$3::jsonb recordCompleted bound `JSON.stringify(sorted)` to a positional `$3::jsonb` cast. postgres.js DOUBLE-ENCODES that (the JSON string is re-stringified into a jsonb STRING scalar, not an array) — the CLAUDE.md / postgres-engine.ts double-encode trap. v0.42.51.0 (garrytan#2255) added the `op_checkpoints_completed_keys_array` CHECK (jsonb_typeof = 'array') without fixing this writer, so every non-empty recordCompleted write (e.g. sync's sync-target pin) now violates the constraint and aborts the sync ("checkpoint target write failed"). Bind the keys as a text[] and convert with to_jsonb($3::text[]) — mirrors appendCompleted's unnest($3::text[]) idiom in the same file; verified to yield a true jsonb array. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ests
The per-phase-timeout work (runPhaseWithTimeout) registers an 'abort'
listener on the parent signal via addEventListener. The handler tests were
passing a fake `{ aborted: false } as any` mock that has no addEventListener,
so those 4 tests threw TypeError once the timeout path engaged. The real
worker passes a genuine AbortSignal (production cycles run fine). Swap the
mocks for `new AbortController().signal`. No production code change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Autopilot resilience + hot-memory hardening (6 commits + a test fix). All off
origin/master.Commits
git rev-parse) so it's stable regardless of CWD.status:'skipped', reason:'phase_timeout'), and the cycle continues instead of dying at the job wall-clock. Partial work is durable (watermarks) and resumes next cycle.facts.default_visibilityknob (configurable hot-memory default).completed_keysviato_jsonb(text[])instead ofJSON.stringify+$3::jsonb(the latter double-encodes → jsonb string scalar → violates theop_checkpoints_completed_keys_arrayCHECK). (v0.42.51.0 fix(sync): contention-free clock + checkpoint integrity + honest sync freshness #2255 follow-up.)AbortSignalto the autopilot-cycle handler tests.runPhaseWithTimeoutregisters an abort listener viaparentSignal.addEventListener; the tests passed a fake{ aborted: false } as anymock with noaddEventListener, so they threw once the timeout path engaged. The real worker passes a genuineAbortSignal(production cycles run fine). No production code change.Tests
Full suite green in a clean environment. (Two tests are environment-sensitive only when run against a live machine:
resolveBoostMapreadsGBRAIN_SOURCE_BOOSTfrom the ambient shell, andworker-registry.serialcontends with a running autopilot writing under~/.gbrain— both pass in isolated CI.)🤖 Generated with Claude Code