feat(sdlc): FM-11 lane supervisor — dead lanes always auto-restart#3803
Conversation
Build the FM-11 lane supervisor (CASE-CROSS-RUNTIME-COMMS-001, reform
Phase 6): a dead lane is ALWAYS respawned regardless of task presence,
across the claude/codex/antigrav runtimes (operator standing mandate —
idle-await is fine, dead is not).
- scripts/hapax-lane-supervisor: dedicated one_for_one liveness supervisor.
Liveness = claude pidfile-alive OR tmux session; codex/antigrav tmux
session. Respawn (always, when the worktree exists + past cooldown +
under a StartLimit-style burst cap): claude+task -> headless resume;
claude+no-task -> read-only idle-await (the headless launcher is
default-deny when task-less); codex/antigrav -> tmux launcher --no-claim.
Clean split: the supervisor guarantees the process exists; the
launcher/dispatcher decides what it does (so respawning a quota-walled
or task-less lane into idle-await is correct, not spam).
- systemd/units/hapax-lane-supervisor.{service,timer}: a wired 60s oneshot.
The old dead-lane block lived in hapax-lane-rate-limit-watchdog, which
was never attached to any timer.
- systemd/units/hapax-claude-lane@.service: Restart=no -> always +
RestartSec + StartLimitIntervalSec/StartLimitBurst (FM-11 FORMALIZE).
- scripts/hapax-lane-rate-limit-watchdog: drop the unwired, task-gated,
greek-only dead-lane block (superseded by the supervisor) and add a
HAPAX_HEADLESS_PIPE_DIR override so tests do not read the host's real
/run/user lane runtime.
- tests: add test_lane_supervisor.py (12) + test_lane_supervisor_units.py
(4); replace the locked-in non-restart assertion in
test_lane_watchdog_methodology_dispatch.py with the clean-split
delegation assertion.
Acceptance criteria met: liveness guaranteed regardless of task presence,
cx-*/antigrav covered (not greek-only), dead-no-task lane respawned.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR introduces a dedicated lane supervisor system that periodically ensures FM-11 lanes remain alive. It refactors the rate-limit watchdog to remove dead-lane restart logic, delegates that responsibility to a new periodic supervisor script driven by systemd, and updates the lane service template to enable automatic restarts with rate limiting. ChangesFM-11 Lane Supervisor System
Sequence DiagramsequenceDiagram
participant Timer as systemd Timer
participant Supervisor as hapax-lane-supervisor
participant Alive as Lane Liveness Check
participant Task as Task Vault Lookup
participant Respawn as Lane Respawn
participant Lane as Claude/Codex/Antigrav Lane
Timer->>Supervisor: Trigger every 60s
Supervisor->>Alive: Check if lane alive (PIDfile/tmux)
alt Lane is dead
Supervisor->>Task: Resolve claimed task (if Claude)
Task-->>Supervisor: task_id, title or empty
Supervisor->>Respawn: Execute kind-specific respawn
alt Claude with claimed task
Respawn->>Lane: Launch headless (setsid/disown)
else Claude without task
Respawn->>Lane: Launch read-only idle-await
else Codex or Antigrav
Respawn->>Lane: Launch tmux idle-await session
end
Supervisor->>Supervisor: Record restart timestamp (spam control)
else Lane is alive
Supervisor-->>Supervisor: Skip this lane
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 96085b0385
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then | ||
| return 0 |
There was a problem hiding this comment.
Track the headless wrapper instead of its child pid
When a Claude lane is already running under hapax-claude-headless, this pidfile only contains the current inner claude child PID; the wrapper writes that PID before wait and then sleeps before restarting it (scripts/hapax-claude-headless lines 165-188). If the child exits or hits a quota/error and the 60s supervisor timer runs during that backoff window, kill -0 fails even though the wrapper is alive, so this new supervisor starts a second headless process on the same lane/task/worktree. That can create concurrent mutating agents for one claim; the liveness check needs to distinguish the supervisor wrapper (or another durable marker like a live FIFO owned by the wrapper) from the transient child PID.
Useful? React with 👍 / 👎.
| # the operator-attended integrator, never auto-respawned). cx-*/antigrav default | ||
| # to the operator-designated always-on set; all overridable for ops + tests. | ||
| CLAUDE_LANES="${HAPAX_SUPERVISOR_CLAUDE_LANES:-beta gamma delta epsilon zeta eta theta}" | ||
| CODEX_LANES="${HAPAX_SUPERVISOR_CODEX_LANES:-cx-cpu-bridge cx-effect-drift-current cx-preset-storm cx-visual-bandwidth}" |
There was a problem hiding this comment.
Include required Codex lanes in the default sweep
With the new timer service using this default environment, HAPAX_SUPERVISOR_CODEX_LANES is not set anywhere by the unit, so only these four task-named cx-* sessions are supervised. The repo instructions identify cx-red and protected cx-violet as lanes that require screen/session continuity (AGENTS.md lines 10-13), but a dead hapax-codex-cx-red or hapax-codex-cx-violet will never be checked unless an operator separately overrides the env, which defeats the advertised always-on cx coverage.
Useful? React with 👍 / 👎.
| EnvironmentFile=-%h/.cache/hapax/claude-headless/%i.env | ||
| ExecStart=/usr/bin/env bash -lc 'exec "$HOME/.local/bin/hapax-methodology-dispatch" --lane "%i" --platform claude --mode headless --task "${HAPAX_DISPATCH_TASK:?HAPAX_DISPATCH_TASK required in per-lane env}" --launch' | ||
| Restart=no | ||
| Restart=always |
There was a problem hiding this comment.
Avoid restarting a dispatch path without MQ binding
For the systemd-managed Claude lanes, changing this to Restart=always repeatedly invokes the same ExecStart, but that command only passes --task from HAPAX_DISPATCH_TASK and never provides the durable MQ message id required by scripts/hapax-methodology-dispatch for mutable --launch calls (it blocks with strict_mq_message_id_required at lines 1284-1299 unless HAPAX_METHODOLOGY_DISPATCH_MESSAGE_ID is present). In the normal per-lane env this turns a failed launch into a StartLimit crash loop rather than restoring the lane, so the restart path needs to carry the dispatch message id or avoid the strict launch route.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/scripts/test_lane_supervisor.py (2)
251-259: ⚡ Quick winConsider adding a live-skip test for antigrav lanes.
The test suite includes
test_supervisor_skips_live_codex_lane(line 237) but lacks a paralleltest_supervisor_skips_live_antigrav_lane. Since antigrav uses the same tmux-based liveness detection as codex, adding this test would improve coverage symmetry.📋 Proposed additional test
Add after line 259:
def test_supervisor_skips_live_antigrav_lane(tmp_path: Path) -> None: env, calls = _base( tmp_path, HAPAX_SUPERVISOR_ANTIGRAV_LANES="antigrav", TMUX_LIVE="hapax-antigrav-antigrav", ) _make_worktree(env, "antigrav") result = _run(env) assert result.returncode == 0, result.stderr assert _reads(calls, "antigrav.txt") == ""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/scripts/test_lane_supervisor.py` around lines 251 - 259, Add a new test mirroring test_supervisor_skips_live_codex_lane named test_supervisor_skips_live_antigrav_lane: call _base(...) with HAPAX_SUPERVISOR_ANTIGRAV_LANES="antigrav" and TMUX_LIVE="hapax-antigrav-antigrav", create the worktree with _make_worktree(env, "antigrav"), run the supervisor with _run(env), then assert result.returncode == 0 and assert _reads(calls, "antigrav.txt") == "" to verify the supervisor skips a live antigrav lane.
275-288: 💤 Low valueConsider asserting on log messages in guardrail tests.
Both
test_supervisor_respects_restart_cooldownandtest_supervisor_burst_limit_backs_offverify the absence of extra launches but do not assert that the supervisor logged why it skipped (cooldown window vs. burst limit hit). Adding assertions onresult.stdoutwould improve observability and strengthen confidence that the right guardrail triggered.📋 Example enhancements
In
test_supervisor_respects_restart_cooldown:# Second pass within cooldown must NOT respawn again. - _run(env) + result = _run(env) assert _reads(calls, "claude.txt") == first + assert "cooldown" in result.stdout.lower()In
test_supervisor_burst_limit_backs_off:for _ in range(4): - _run(env) + result = _run(env) launches = [ln for ln in _reads(calls, "claude.txt").splitlines() if ln.strip()] assert len(launches) == 2 # capped at burst limit + assert "burst" in result.stdout.lower() or "limit" in result.stdout.lower()Also applies to: 306-321
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/scripts/test_lane_supervisor.py` around lines 275 - 288, The tests currently check that no new launch occurred but don't assert the supervisor logged why; update test_supervisor_respects_restart_cooldown and test_supervisor_burst_limit_backs_off to capture the result from the second _run() call (e.g., result = _run(env)) and add an assertion on result.stdout that the supervisor emitted a clear explanatory message—for the cooldown test assert the stdout contains keywords like "skipping" and "cooldown" (referencing test_supervisor_respects_restart_cooldown, _run, and _reads/claude.txt), and for the burst-limit test assert stdout contains keywords like "skipping" and "burst limit" (referencing test_supervisor_burst_limit_backs_off and _run) so the guardrail reason is verifiably logged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/hapax-lane-supervisor`:
- Around line 82-95: The restart-log is allowed to grow unbounded because the
checker (the while read -r ts loop that reads "$logf" and compares timestamps to
cutoff using BURST_WINDOW_S and increments count towards BURST_LIMIT) scans the
whole file each tick while record_restart() only appends; fix by
truncating/rotating the log when recording a restart: update record_restart() to
append the timestamp then trim the file to a bounded size (either remove entries
older than now - BURST_WINDOW_S or keep only the last N lines) so the subsequent
loop only ever reads a small, recent set of timestamps; ensure you still write
"$STATE_DIR/$1.last-restart" and use the same variables now, BURST_WINDOW_S, and
BURST_LIMIT.
In `@tests/scripts/test_lane_supervisor.py`:
- Around line 31-39: The bash snippet in _write_recorder interpolates the Path
log directly into the script which can allow shell injection or break on spaces;
escape or shell-quote the path before embedding it (e.g., use
shlex.quote(str(log))) when calling _write_executable so the redirect target is
safe, and keep the reference to the log target inside the generated script (the
change is local to _write_recorder and affects how the "{log}" token is produced
for _write_executable).
---
Nitpick comments:
In `@tests/scripts/test_lane_supervisor.py`:
- Around line 251-259: Add a new test mirroring
test_supervisor_skips_live_codex_lane named
test_supervisor_skips_live_antigrav_lane: call _base(...) with
HAPAX_SUPERVISOR_ANTIGRAV_LANES="antigrav" and
TMUX_LIVE="hapax-antigrav-antigrav", create the worktree with
_make_worktree(env, "antigrav"), run the supervisor with _run(env), then assert
result.returncode == 0 and assert _reads(calls, "antigrav.txt") == "" to verify
the supervisor skips a live antigrav lane.
- Around line 275-288: The tests currently check that no new launch occurred but
don't assert the supervisor logged why; update
test_supervisor_respects_restart_cooldown and
test_supervisor_burst_limit_backs_off to capture the result from the second
_run() call (e.g., result = _run(env)) and add an assertion on result.stdout
that the supervisor emitted a clear explanatory message—for the cooldown test
assert the stdout contains keywords like "skipping" and "cooldown" (referencing
test_supervisor_respects_restart_cooldown, _run, and _reads/claude.txt), and for
the burst-limit test assert stdout contains keywords like "skipping" and "burst
limit" (referencing test_supervisor_burst_limit_backs_off and _run) so the
guardrail reason is verifiably logged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: e13a022d-e1e7-43d5-9592-555cb4956f9b
📒 Files selected for processing (8)
scripts/hapax-lane-rate-limit-watchdogscripts/hapax-lane-supervisorsystemd/units/hapax-claude-lane@.servicesystemd/units/hapax-lane-supervisor.servicesystemd/units/hapax-lane-supervisor.timertests/scripts/test_lane_supervisor.pytests/scripts/test_lane_watchdog_methodology_dispatch.pytests/systemd/test_lane_supervisor_units.py
| local logf="$STATE_DIR/$1.restart-log" cutoff count=0 ts | ||
| [ -f "$logf" ] || return 1 | ||
| cutoff=$(( now - BURST_WINDOW_S )) | ||
| while read -r ts; do | ||
| [ -z "$ts" ] && continue | ||
| [ "$ts" -ge "$cutoff" ] && count=$(( count + 1 )) | ||
| done < "$logf" | ||
| [ "$count" -ge "$BURST_LIMIT" ] | ||
| } | ||
|
|
||
| record_restart() { | ||
| echo "$now" > "$STATE_DIR/$1.last-restart" | ||
| echo "$now" >> "$STATE_DIR/$1.restart-log" | ||
| } |
There was a problem hiding this comment.
Bound restart-log growth to keep supervisor runtime stable.
Line 82–89 scans full *.restart-log files each tick, while Line 94 only appends. For repeatedly dead lanes, this becomes unbounded and can eventually threaten the oneshot timeout budget.
💡 Suggested patch
burst_exhausted() {
- local logf="$STATE_DIR/$1.restart-log" cutoff count=0 ts
+ local logf="$STATE_DIR/$1.restart-log" cutoff count
[ -f "$logf" ] || return 1
cutoff=$(( now - BURST_WINDOW_S ))
- while read -r ts; do
- [ -z "$ts" ] && continue
- [ "$ts" -ge "$cutoff" ] && count=$(( count + 1 ))
- done < "$logf"
+ count="$(awk -v cutoff="$cutoff" '$1 >= cutoff {c++} END {print c+0}' "$logf" 2>/dev/null || echo 0)"
[ "$count" -ge "$BURST_LIMIT" ]
}
record_restart() {
- echo "$now" > "$STATE_DIR/$1.last-restart"
- echo "$now" >> "$STATE_DIR/$1.restart-log"
+ local lane="$1" logf="$STATE_DIR/$1.restart-log" cutoff
+ echo "$now" > "$STATE_DIR/$lane.last-restart"
+ cutoff=$(( now - BURST_WINDOW_S ))
+ if [ -f "$logf" ]; then
+ awk -v cutoff="$cutoff" '$1 >= cutoff' "$logf" > "$logf.tmp" 2>/dev/null || true
+ mv "$logf.tmp" "$logf"
+ fi
+ echo "$now" >> "$logf"
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| local logf="$STATE_DIR/$1.restart-log" cutoff count=0 ts | |
| [ -f "$logf" ] || return 1 | |
| cutoff=$(( now - BURST_WINDOW_S )) | |
| while read -r ts; do | |
| [ -z "$ts" ] && continue | |
| [ "$ts" -ge "$cutoff" ] && count=$(( count + 1 )) | |
| done < "$logf" | |
| [ "$count" -ge "$BURST_LIMIT" ] | |
| } | |
| record_restart() { | |
| echo "$now" > "$STATE_DIR/$1.last-restart" | |
| echo "$now" >> "$STATE_DIR/$1.restart-log" | |
| } | |
| local logf="$STATE_DIR/$1.restart-log" cutoff count | |
| [ -f "$logf" ] || return 1 | |
| cutoff=$(( now - BURST_WINDOW_S )) | |
| count="$(awk -v cutoff="$cutoff" '$1 >= cutoff {c++} END {print c+0}' "$logf" 2>/dev/null || echo 0)" | |
| [ "$count" -ge "$BURST_LIMIT" ] | |
| } | |
| record_restart() { | |
| local lane="$1" logf="$STATE_DIR/$1.restart-log" cutoff | |
| echo "$now" > "$STATE_DIR/$lane.last-restart" | |
| cutoff=$(( now - BURST_WINDOW_S )) | |
| if [ -f "$logf" ]; then | |
| awk -v cutoff="$cutoff" '$1 >= cutoff' "$logf" > "$logf.tmp" 2>/dev/null || true | |
| mv "$logf.tmp" "$logf" | |
| fi | |
| echo "$now" >> "$logf" | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/hapax-lane-supervisor` around lines 82 - 95, The restart-log is
allowed to grow unbounded because the checker (the while read -r ts loop that
reads "$logf" and compares timestamps to cutoff using BURST_WINDOW_S and
increments count towards BURST_LIMIT) scans the whole file each tick while
record_restart() only appends; fix by truncating/rotating the log when recording
a restart: update record_restart() to append the timestamp then trim the file to
a bounded size (either remove entries older than now - BURST_WINDOW_S or keep
only the last N lines) so the subsequent loop only ever reads a small, recent
set of timestamps; ensure you still write "$STATE_DIR/$1.last-restart" and use
the same variables now, BURST_WINDOW_S, and BURST_LIMIT.
| def _write_recorder(path: Path, log: Path) -> None: | ||
| """A fake launcher that records its argv to ``log`` and exits 0.""" | ||
| _write_executable( | ||
| path, | ||
| f""" | ||
| #!/usr/bin/env bash | ||
| printf '%s\\n' "$*" >> "{log}" | ||
| """, | ||
| ) |
There was a problem hiding this comment.
Quote the log path to prevent shell injection.
The log path is directly interpolated into the bash script without proper quoting. If log contains spaces or special characters, the script will break.
🛡️ Proposed fix with proper path quoting
def _write_recorder(path: Path, log: Path) -> None:
"""A fake launcher that records its argv to ``log`` and exits 0."""
+ # Escape the log path for safe shell interpolation
+ log_escaped = str(log).replace("'", "'\"'\"'")
_write_executable(
path,
f"""
#!/usr/bin/env bash
- printf '%s\\n' "$*" >> "{log}"
+ printf '%s\\n' "$*" >> '{log_escaped}'
""",
)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/scripts/test_lane_supervisor.py` around lines 31 - 39, The bash snippet
in _write_recorder interpolates the Path log directly into the script which can
allow shell injection or break on spaces; escape or shell-quote the path before
embedding it (e.g., use shlex.quote(str(log))) when calling _write_executable so
the redirect target is safe, and keep the reference to the log target inside the
generated script (the change is local to _write_recorder and affects how the
"{log}" token is produced for _write_executable).
FM-11 lane supervisor — dead lanes always auto-restart
Task:
reform-fix-lane-supervisor-20260531AuthorityCase: CASE-CROSS-RUNTIME-COMMS-001
Parent spec: coordination-reform-master-design-2026-05-30 (§FM-11, Phase 6)
Problem
There was no production lane supervisor.
hapax-claude-lane@.servicehadRestart=no. The only dead-lane respawn logic lived inhapax-lane-rate-limit-watchdog, which was (a) never attached to any timer (dead code), (b) gated restart on task-presence — a dead lane with no active task hit"DEAD with no active task — not restarting"and was left dead (violating the operator mandate that dead lanes must always auto-restart), and (c) covered onlybeta gamma delta epsilon zeta(no cx-*, no antigrav). A test even locked in the non-restart behaviour.Change
A dedicated
hapax-lane-supervisorthat decouples process-liveness from task-presence. Clean split: the supervisor guarantees the process exists; the launcher/dispatcher decides what it does — so respawning a quota-walled or task-less lane into idle-await is correct, not spam.scripts/hapax-lane-supervisor— one_for_one liveness across all runtimes:hapax-claude-headlessresume; claude+no-task →hapax-claude --readonlyidle-await (the headless launcher is default-deny when task-less); codex/antigrav → tmux launcher--no-claim.systemd/units/hapax-lane-supervisor.{service,timer}— a wired 60s oneshot (auto-enabled by theinstall-units.shtimer sweep).systemd/units/hapax-claude-lane@.service—Restart=no→always+RestartSec+StartLimitIntervalSec/StartLimitBurst(FM-11 FORMALIZE; per-lane task binding preserved).scripts/hapax-lane-rate-limit-watchdog— drop the unwired/task-gated/greek-only dead-lane block (superseded); addHAPAX_HEADLESS_PIPE_DIRoverride so tests don't read the host's real/run/userlane runtime.Acceptance criteria
Restart=always+StartLimit and a one_for_one supervisor).Tests
tests/scripts/test_lane_supervisor.py(12): respawn dead-no-task (idle-await) / dead-with-task (resume) claude; cx-*/antigrav respawn; live-lane skip (pidfile + tmux); cooldown; burst/StartLimit back-off; worktree-absent skip; dry-run; shell syntax.tests/systemd/test_lane_supervisor_units.py(4): supervisor oneshot+60s timer;Restart=always+StartLimit; per-lane task env preserved.tests/scripts/test_lane_watchdog_methodology_dispatch.py: replaced the locked-in non-restart assertion with the clean-split delegation assertion.🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Improvements