Skip to content

feat(sdlc): FM-11 lane supervisor — dead lanes always auto-restart#3803

Merged
ryanklee merged 1 commit into
mainfrom
delta/reform-fix-lane-supervisor-20260531
Jun 1, 2026
Merged

feat(sdlc): FM-11 lane supervisor — dead lanes always auto-restart#3803
ryanklee merged 1 commit into
mainfrom
delta/reform-fix-lane-supervisor-20260531

Conversation

@ryanklee
Copy link
Copy Markdown
Collaborator

@ryanklee ryanklee commented May 31, 2026

FM-11 lane supervisor — dead lanes always auto-restart

Task: reform-fix-lane-supervisor-20260531
AuthorityCase: CASE-CROSS-RUNTIME-COMMS-001
Parent spec: coordination-reform-master-design-2026-05-30 (§FM-11, Phase 6)

Problem

There was no production lane supervisor. hapax-claude-lane@.service had Restart=no. The only dead-lane respawn logic lived in hapax-lane-rate-limit-watchdog, which was (a) never attached to any timer (dead code), (b) gated restart on task-presence — a dead lane with no active task hit "DEAD with no active task — not restarting" and was left dead (violating the operator mandate that dead lanes must always auto-restart), and (c) covered only beta gamma delta epsilon zeta (no cx-*, no antigrav). A test even locked in the non-restart behaviour.

Change

A dedicated hapax-lane-supervisor that decouples process-liveness from task-presence. Clean split: the supervisor guarantees the process exists; the launcher/dispatcher decides what it does — so respawning a quota-walled or task-less lane into idle-await is correct, not spam.

  • scripts/hapax-lane-supervisor — one_for_one liveness across all runtimes:
    • claude (greek): pidfile-alive or tmux session; codex (cx-*) / antigrav: tmux session.
    • Respawn (always, when worktree exists + past cooldown + under a StartLimit-style burst cap): claude+task → hapax-claude-headless resume; claude+no-task → hapax-claude --readonly idle-await (the headless launcher is default-deny when task-less); codex/antigrav → tmux launcher --no-claim.
  • systemd/units/hapax-lane-supervisor.{service,timer} — a wired 60s oneshot (auto-enabled by the install-units.sh timer sweep).
  • systemd/units/hapax-claude-lane@.serviceRestart=noalways + RestartSec + StartLimitIntervalSec/StartLimitBurst (FM-11 FORMALIZE; per-lane task binding preserved).
  • scripts/hapax-lane-rate-limit-watchdog — drop the unwired/task-gated/greek-only dead-lane block (superseded); add HAPAX_HEADLESS_PIPE_DIR override so tests don't read the host's real /run/user lane runtime.

Acceptance criteria

  • A supervisor guarantees lane-process liveness regardless of task presence (Restart=always+StartLimit and a one_for_one supervisor).
  • cx-* and antigrav lanes are covered (not greek-only).
  • A dead-no-task lane is respawned (test updated to assert this, replacing the locked-in non-restart assertion).
  • Ruff + tests pass.

Tests

  • tests/scripts/test_lane_supervisor.py (12): respawn dead-no-task (idle-await) / dead-with-task (resume) claude; cx-*/antigrav respawn; live-lane skip (pidfile + tmux); cooldown; burst/StartLimit back-off; worktree-absent skip; dry-run; shell syntax.
  • tests/systemd/test_lane_supervisor_units.py (4): supervisor oneshot+60s timer; Restart=always+StartLimit; per-lane task env preserved.
  • tests/scripts/test_lane_watchdog_methodology_dispatch.py: replaced the locked-in non-restart assertion with the clean-split delegation assertion.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Automated periodic lane health checks running every 60 seconds
    • Automatic lane recovery upon failure detection
  • Improvements

    • Added restart rate limiting to prevent cascading failures (maximum 5 restarts per 5-minute window)
    • Configured automatic service restart with 10-second intervals between attempts

Build the FM-11 lane supervisor (CASE-CROSS-RUNTIME-COMMS-001, reform
Phase 6): a dead lane is ALWAYS respawned regardless of task presence,
across the claude/codex/antigrav runtimes (operator standing mandate —
idle-await is fine, dead is not).

- scripts/hapax-lane-supervisor: dedicated one_for_one liveness supervisor.
  Liveness = claude pidfile-alive OR tmux session; codex/antigrav tmux
  session. Respawn (always, when the worktree exists + past cooldown +
  under a StartLimit-style burst cap): claude+task -> headless resume;
  claude+no-task -> read-only idle-await (the headless launcher is
  default-deny when task-less); codex/antigrav -> tmux launcher --no-claim.
  Clean split: the supervisor guarantees the process exists; the
  launcher/dispatcher decides what it does (so respawning a quota-walled
  or task-less lane into idle-await is correct, not spam).
- systemd/units/hapax-lane-supervisor.{service,timer}: a wired 60s oneshot.
  The old dead-lane block lived in hapax-lane-rate-limit-watchdog, which
  was never attached to any timer.
- systemd/units/hapax-claude-lane@.service: Restart=no -> always +
  RestartSec + StartLimitIntervalSec/StartLimitBurst (FM-11 FORMALIZE).
- scripts/hapax-lane-rate-limit-watchdog: drop the unwired, task-gated,
  greek-only dead-lane block (superseded by the supervisor) and add a
  HAPAX_HEADLESS_PIPE_DIR override so tests do not read the host's real
  /run/user lane runtime.
- tests: add test_lane_supervisor.py (12) + test_lane_supervisor_units.py
  (4); replace the locked-in non-restart assertion in
  test_lane_watchdog_methodology_dispatch.py with the clean-split
  delegation assertion.

Acceptance criteria met: liveness guaranteed regardless of task presence,
cx-*/antigrav covered (not greek-only), dead-no-task lane respawned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces a dedicated lane supervisor system that periodically ensures FM-11 lanes remain alive. It refactors the rate-limit watchdog to remove dead-lane restart logic, delegates that responsibility to a new periodic supervisor script driven by systemd, and updates the lane service template to enable automatic restarts with rate limiting.

Changes

FM-11 Lane Supervisor System

Layer / File(s) Summary
Rate-limit watchdog refactoring
scripts/hapax-lane-rate-limit-watchdog, tests/scripts/test_lane_watchdog_methodology_dispatch.py
Removes dead-lane auto-restart logic and adds HEADLESS_PIPE_DIR configuration for test sandboxing. Watchdog now delegates dead-lane restart responsibility to the supervisor and focuses solely on quota-aware re-kicking of live idle lanes.
Lane supervisor core
scripts/hapax-lane-supervisor
New script implementing liveness checks (Claude via PIDfile/tmux, Codex/Antigrav via tmux), per-lane restart spam control (cooldown + burst-limit), claimed-task resolution from vault, and kind-specific respawn commands (Claude headless/idle-await depending on task, Codex/Antigrav tmux idle-await).
Lane service unit restart policy
systemd/units/hapax-claude-lane@.service
Enables Restart=always (replacing no), adds RestartSec=10 inter-restart delay, and configures StartLimitIntervalSec=300 and StartLimitBurst=5 to bound restart frequency.
Supervisor systemd units
systemd/units/hapax-lane-supervisor.service, systemd/units/hapax-lane-supervisor.timer
New service unit (oneshot, 120s timeout, journald logging) and timer unit (60-second intervals after boot) that run the supervisor to guarantee lane liveness.
Supervisor behavior tests
tests/scripts/test_lane_supervisor.py
Comprehensive pytest suite: dead-lane respawn paths, liveness skipping, guardrails (cooldown, burst-limit, dry-run, missing worktree), test harness, and shell-syntax validation.
Systemd unit validation
tests/systemd/test_lane_supervisor_units.py
Integration tests confirming supervisor service/timer configuration, Claude lane template restart policy and rate limits, and per-lane task environment wiring.

Sequence Diagram

sequenceDiagram
  participant Timer as systemd Timer
  participant Supervisor as hapax-lane-supervisor
  participant Alive as Lane Liveness Check
  participant Task as Task Vault Lookup
  participant Respawn as Lane Respawn
  participant Lane as Claude/Codex/Antigrav Lane

  Timer->>Supervisor: Trigger every 60s
  Supervisor->>Alive: Check if lane alive (PIDfile/tmux)
  alt Lane is dead
    Supervisor->>Task: Resolve claimed task (if Claude)
    Task-->>Supervisor: task_id, title or empty
    Supervisor->>Respawn: Execute kind-specific respawn
    alt Claude with claimed task
      Respawn->>Lane: Launch headless (setsid/disown)
    else Claude without task
      Respawn->>Lane: Launch read-only idle-await
    else Codex or Antigrav
      Respawn->>Lane: Launch tmux idle-await session
    end
    Supervisor->>Supervisor: Record restart timestamp (spam control)
  else Lane is alive
    Supervisor-->>Supervisor: Skip this lane
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • hapax-systems/hapax-council#3047: Lane watchdog's quota-blocked receipt filtering (both PRs modify scripts/hapax-lane-rate-limit-watchdog around receipt handling).
  • hapax-systems/hapax-council#3387: Dead-lane restart behavior changes in the watchdog (both PRs modify the same script with conflicting dead-lane restart logic).

Poem

🐰 A supervisor hops through the lanes each minute,
Checking if Claude and Codex are still in it.
Dead lanes respawn with careful control,
No spam, no flood—just keeping them whole.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.93% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly describes the main change: introducing a lane supervisor that ensures dead lanes always auto-restart, matching the primary objective of the PR.
Description check ✅ Passed The description includes all required template sections: a clear summary of the problem and solution, AuthorityCase/Slice metadata, comprehensive test plan with specific test counts, and CLAUDE.md hygiene checklist.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch delta/reform-fix-lane-supervisor-20260531

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 96085b0385

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +62 to +63
if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
return 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Track the headless wrapper instead of its child pid

When a Claude lane is already running under hapax-claude-headless, this pidfile only contains the current inner claude child PID; the wrapper writes that PID before wait and then sleeps before restarting it (scripts/hapax-claude-headless lines 165-188). If the child exits or hits a quota/error and the 60s supervisor timer runs during that backoff window, kill -0 fails even though the wrapper is alive, so this new supervisor starts a second headless process on the same lane/task/worktree. That can create concurrent mutating agents for one claim; the liveness check needs to distinguish the supervisor wrapper (or another durable marker like a live FIFO owned by the wrapper) from the transient child PID.

Useful? React with 👍 / 👎.

# the operator-attended integrator, never auto-respawned). cx-*/antigrav default
# to the operator-designated always-on set; all overridable for ops + tests.
CLAUDE_LANES="${HAPAX_SUPERVISOR_CLAUDE_LANES:-beta gamma delta epsilon zeta eta theta}"
CODEX_LANES="${HAPAX_SUPERVISOR_CODEX_LANES:-cx-cpu-bridge cx-effect-drift-current cx-preset-storm cx-visual-bandwidth}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include required Codex lanes in the default sweep

With the new timer service using this default environment, HAPAX_SUPERVISOR_CODEX_LANES is not set anywhere by the unit, so only these four task-named cx-* sessions are supervised. The repo instructions identify cx-red and protected cx-violet as lanes that require screen/session continuity (AGENTS.md lines 10-13), but a dead hapax-codex-cx-red or hapax-codex-cx-violet will never be checked unless an operator separately overrides the env, which defeats the advertised always-on cx coverage.

Useful? React with 👍 / 👎.

EnvironmentFile=-%h/.cache/hapax/claude-headless/%i.env
ExecStart=/usr/bin/env bash -lc 'exec "$HOME/.local/bin/hapax-methodology-dispatch" --lane "%i" --platform claude --mode headless --task "${HAPAX_DISPATCH_TASK:?HAPAX_DISPATCH_TASK required in per-lane env}" --launch'
Restart=no
Restart=always
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid restarting a dispatch path without MQ binding

For the systemd-managed Claude lanes, changing this to Restart=always repeatedly invokes the same ExecStart, but that command only passes --task from HAPAX_DISPATCH_TASK and never provides the durable MQ message id required by scripts/hapax-methodology-dispatch for mutable --launch calls (it blocks with strict_mq_message_id_required at lines 1284-1299 unless HAPAX_METHODOLOGY_DISPATCH_MESSAGE_ID is present). In the normal per-lane env this turns a failed launch into a StartLimit crash loop rather than restoring the lane, so the restart path needs to carry the dispatch message id or avoid the strict launch route.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/scripts/test_lane_supervisor.py (2)

251-259: ⚡ Quick win

Consider adding a live-skip test for antigrav lanes.

The test suite includes test_supervisor_skips_live_codex_lane (line 237) but lacks a parallel test_supervisor_skips_live_antigrav_lane. Since antigrav uses the same tmux-based liveness detection as codex, adding this test would improve coverage symmetry.

📋 Proposed additional test

Add after line 259:

def test_supervisor_skips_live_antigrav_lane(tmp_path: Path) -> None:
    env, calls = _base(
        tmp_path,
        HAPAX_SUPERVISOR_ANTIGRAV_LANES="antigrav",
        TMUX_LIVE="hapax-antigrav-antigrav",
    )
    _make_worktree(env, "antigrav")

    result = _run(env)

    assert result.returncode == 0, result.stderr
    assert _reads(calls, "antigrav.txt") == ""
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/scripts/test_lane_supervisor.py` around lines 251 - 259, Add a new test
mirroring test_supervisor_skips_live_codex_lane named
test_supervisor_skips_live_antigrav_lane: call _base(...) with
HAPAX_SUPERVISOR_ANTIGRAV_LANES="antigrav" and
TMUX_LIVE="hapax-antigrav-antigrav", create the worktree with
_make_worktree(env, "antigrav"), run the supervisor with _run(env), then assert
result.returncode == 0 and assert _reads(calls, "antigrav.txt") == "" to verify
the supervisor skips a live antigrav lane.

275-288: 💤 Low value

Consider asserting on log messages in guardrail tests.

Both test_supervisor_respects_restart_cooldown and test_supervisor_burst_limit_backs_off verify the absence of extra launches but do not assert that the supervisor logged why it skipped (cooldown window vs. burst limit hit). Adding assertions on result.stdout would improve observability and strengthen confidence that the right guardrail triggered.

📋 Example enhancements

In test_supervisor_respects_restart_cooldown:

     # Second pass within cooldown must NOT respawn again.
-    _run(env)
+    result = _run(env)
     assert _reads(calls, "claude.txt") == first
+    assert "cooldown" in result.stdout.lower()

In test_supervisor_burst_limit_backs_off:

     for _ in range(4):
-        _run(env)
+        result = _run(env)
 
     launches = [ln for ln in _reads(calls, "claude.txt").splitlines() if ln.strip()]
     assert len(launches) == 2  # capped at burst limit
+    assert "burst" in result.stdout.lower() or "limit" in result.stdout.lower()

Also applies to: 306-321

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/scripts/test_lane_supervisor.py` around lines 275 - 288, The tests
currently check that no new launch occurred but don't assert the supervisor
logged why; update test_supervisor_respects_restart_cooldown and
test_supervisor_burst_limit_backs_off to capture the result from the second
_run() call (e.g., result = _run(env)) and add an assertion on result.stdout
that the supervisor emitted a clear explanatory message—for the cooldown test
assert the stdout contains keywords like "skipping" and "cooldown" (referencing
test_supervisor_respects_restart_cooldown, _run, and _reads/claude.txt), and for
the burst-limit test assert stdout contains keywords like "skipping" and "burst
limit" (referencing test_supervisor_burst_limit_backs_off and _run) so the
guardrail reason is verifiably logged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/hapax-lane-supervisor`:
- Around line 82-95: The restart-log is allowed to grow unbounded because the
checker (the while read -r ts loop that reads "$logf" and compares timestamps to
cutoff using BURST_WINDOW_S and increments count towards BURST_LIMIT) scans the
whole file each tick while record_restart() only appends; fix by
truncating/rotating the log when recording a restart: update record_restart() to
append the timestamp then trim the file to a bounded size (either remove entries
older than now - BURST_WINDOW_S or keep only the last N lines) so the subsequent
loop only ever reads a small, recent set of timestamps; ensure you still write
"$STATE_DIR/$1.last-restart" and use the same variables now, BURST_WINDOW_S, and
BURST_LIMIT.

In `@tests/scripts/test_lane_supervisor.py`:
- Around line 31-39: The bash snippet in _write_recorder interpolates the Path
log directly into the script which can allow shell injection or break on spaces;
escape or shell-quote the path before embedding it (e.g., use
shlex.quote(str(log))) when calling _write_executable so the redirect target is
safe, and keep the reference to the log target inside the generated script (the
change is local to _write_recorder and affects how the "{log}" token is produced
for _write_executable).

---

Nitpick comments:
In `@tests/scripts/test_lane_supervisor.py`:
- Around line 251-259: Add a new test mirroring
test_supervisor_skips_live_codex_lane named
test_supervisor_skips_live_antigrav_lane: call _base(...) with
HAPAX_SUPERVISOR_ANTIGRAV_LANES="antigrav" and
TMUX_LIVE="hapax-antigrav-antigrav", create the worktree with
_make_worktree(env, "antigrav"), run the supervisor with _run(env), then assert
result.returncode == 0 and assert _reads(calls, "antigrav.txt") == "" to verify
the supervisor skips a live antigrav lane.
- Around line 275-288: The tests currently check that no new launch occurred but
don't assert the supervisor logged why; update
test_supervisor_respects_restart_cooldown and
test_supervisor_burst_limit_backs_off to capture the result from the second
_run() call (e.g., result = _run(env)) and add an assertion on result.stdout
that the supervisor emitted a clear explanatory message—for the cooldown test
assert the stdout contains keywords like "skipping" and "cooldown" (referencing
test_supervisor_respects_restart_cooldown, _run, and _reads/claude.txt), and for
the burst-limit test assert stdout contains keywords like "skipping" and "burst
limit" (referencing test_supervisor_burst_limit_backs_off and _run) so the
guardrail reason is verifiably logged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e13a022d-e1e7-43d5-9592-555cb4956f9b

📥 Commits

Reviewing files that changed from the base of the PR and between 1dd37ed and 96085b0.

📒 Files selected for processing (8)
  • scripts/hapax-lane-rate-limit-watchdog
  • scripts/hapax-lane-supervisor
  • systemd/units/hapax-claude-lane@.service
  • systemd/units/hapax-lane-supervisor.service
  • systemd/units/hapax-lane-supervisor.timer
  • tests/scripts/test_lane_supervisor.py
  • tests/scripts/test_lane_watchdog_methodology_dispatch.py
  • tests/systemd/test_lane_supervisor_units.py

Comment on lines +82 to +95
local logf="$STATE_DIR/$1.restart-log" cutoff count=0 ts
[ -f "$logf" ] || return 1
cutoff=$(( now - BURST_WINDOW_S ))
while read -r ts; do
[ -z "$ts" ] && continue
[ "$ts" -ge "$cutoff" ] && count=$(( count + 1 ))
done < "$logf"
[ "$count" -ge "$BURST_LIMIT" ]
}

record_restart() {
echo "$now" > "$STATE_DIR/$1.last-restart"
echo "$now" >> "$STATE_DIR/$1.restart-log"
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bound restart-log growth to keep supervisor runtime stable.

Line 82–89 scans full *.restart-log files each tick, while Line 94 only appends. For repeatedly dead lanes, this becomes unbounded and can eventually threaten the oneshot timeout budget.

💡 Suggested patch
 burst_exhausted() {
-  local logf="$STATE_DIR/$1.restart-log" cutoff count=0 ts
+  local logf="$STATE_DIR/$1.restart-log" cutoff count
   [ -f "$logf" ] || return 1
   cutoff=$(( now - BURST_WINDOW_S ))
-  while read -r ts; do
-    [ -z "$ts" ] && continue
-    [ "$ts" -ge "$cutoff" ] && count=$(( count + 1 ))
-  done < "$logf"
+  count="$(awk -v cutoff="$cutoff" '$1 >= cutoff {c++} END {print c+0}' "$logf" 2>/dev/null || echo 0)"
   [ "$count" -ge "$BURST_LIMIT" ]
 }
 
 record_restart() {
-  echo "$now" > "$STATE_DIR/$1.last-restart"
-  echo "$now" >> "$STATE_DIR/$1.restart-log"
+  local lane="$1" logf="$STATE_DIR/$1.restart-log" cutoff
+  echo "$now" > "$STATE_DIR/$lane.last-restart"
+  cutoff=$(( now - BURST_WINDOW_S ))
+  if [ -f "$logf" ]; then
+    awk -v cutoff="$cutoff" '$1 >= cutoff' "$logf" > "$logf.tmp" 2>/dev/null || true
+    mv "$logf.tmp" "$logf"
+  fi
+  echo "$now" >> "$logf"
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
local logf="$STATE_DIR/$1.restart-log" cutoff count=0 ts
[ -f "$logf" ] || return 1
cutoff=$(( now - BURST_WINDOW_S ))
while read -r ts; do
[ -z "$ts" ] && continue
[ "$ts" -ge "$cutoff" ] && count=$(( count + 1 ))
done < "$logf"
[ "$count" -ge "$BURST_LIMIT" ]
}
record_restart() {
echo "$now" > "$STATE_DIR/$1.last-restart"
echo "$now" >> "$STATE_DIR/$1.restart-log"
}
local logf="$STATE_DIR/$1.restart-log" cutoff count
[ -f "$logf" ] || return 1
cutoff=$(( now - BURST_WINDOW_S ))
count="$(awk -v cutoff="$cutoff" '$1 >= cutoff {c++} END {print c+0}' "$logf" 2>/dev/null || echo 0)"
[ "$count" -ge "$BURST_LIMIT" ]
}
record_restart() {
local lane="$1" logf="$STATE_DIR/$1.restart-log" cutoff
echo "$now" > "$STATE_DIR/$lane.last-restart"
cutoff=$(( now - BURST_WINDOW_S ))
if [ -f "$logf" ]; then
awk -v cutoff="$cutoff" '$1 >= cutoff' "$logf" > "$logf.tmp" 2>/dev/null || true
mv "$logf.tmp" "$logf"
fi
echo "$now" >> "$logf"
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/hapax-lane-supervisor` around lines 82 - 95, The restart-log is
allowed to grow unbounded because the checker (the while read -r ts loop that
reads "$logf" and compares timestamps to cutoff using BURST_WINDOW_S and
increments count towards BURST_LIMIT) scans the whole file each tick while
record_restart() only appends; fix by truncating/rotating the log when recording
a restart: update record_restart() to append the timestamp then trim the file to
a bounded size (either remove entries older than now - BURST_WINDOW_S or keep
only the last N lines) so the subsequent loop only ever reads a small, recent
set of timestamps; ensure you still write "$STATE_DIR/$1.last-restart" and use
the same variables now, BURST_WINDOW_S, and BURST_LIMIT.

Comment on lines +31 to +39
def _write_recorder(path: Path, log: Path) -> None:
"""A fake launcher that records its argv to ``log`` and exits 0."""
_write_executable(
path,
f"""
#!/usr/bin/env bash
printf '%s\\n' "$*" >> "{log}"
""",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Quote the log path to prevent shell injection.

The log path is directly interpolated into the bash script without proper quoting. If log contains spaces or special characters, the script will break.

🛡️ Proposed fix with proper path quoting
 def _write_recorder(path: Path, log: Path) -> None:
     """A fake launcher that records its argv to ``log`` and exits 0."""
+    # Escape the log path for safe shell interpolation
+    log_escaped = str(log).replace("'", "'\"'\"'")
     _write_executable(
         path,
         f"""
         #!/usr/bin/env bash
-        printf '%s\\n' "$*" >> "{log}"
+        printf '%s\\n' "$*" >> '{log_escaped}'
         """,
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/scripts/test_lane_supervisor.py` around lines 31 - 39, The bash snippet
in _write_recorder interpolates the Path log directly into the script which can
allow shell injection or break on spaces; escape or shell-quote the path before
embedding it (e.g., use shlex.quote(str(log))) when calling _write_executable so
the redirect target is safe, and keep the reference to the log target inside the
generated script (the change is local to _write_recorder and affects how the
"{log}" token is produced for _write_executable).

@ryanklee ryanklee added this pull request to the merge queue Jun 1, 2026
Merged via the queue into main with commit a42e777 Jun 1, 2026
38 checks passed
@ryanklee ryanklee deleted the delta/reform-fix-lane-supervisor-20260531 branch June 1, 2026 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant