Skip to content

fix(engine): wall-clock deadline + static lint test — run_next cannot hang#368

Open
jwaldrip wants to merge 4 commits into
mainfrom
fix/run-next-loops-bounded
Open

fix(engine): wall-clock deadline + static lint test — run_next cannot hang#368
jwaldrip wants to merge 4 commits into
mainfrom
fix/run-next-loops-bounded

Conversation

@jwaldrip
Copy link
Copy Markdown
Member

Per user goal (2026-05-15): "no matter what circumstance, the run_next call should never get stuck in a loop. It can block via a SPA pop, but it CANNOT hang for infinite recursion or stuck while loops, and you need to have tests and or lint rules that validate any call to run_next cannot hang with near absolute certainty."

Summary

Three structural additions on top of the existing `RUN_NEXT_LOOP_CAP=16`:

  1. Wall-clock deadline (`RUN_NEXT_WALL_DEADLINE_MS=90_000`). Captured once at handler entry. Three structural while-loops (`select_*`, `close_feedback`, `complete_stage`) check it alongside the iter counter. If the whole call exceeds 90s — real disk-walks finish in milliseconds; 90s = wedged — the engine returns `loop_aborted` with the elapsed-ms diagnostic.
  2. `gate_review` loop is deliberately exempt: its body calls `await_gate`, the SPA-pop blocking primitive the user explicitly carved out. A real human-in-the-loop decision can take minutes. The iteration cap + no-progress signature check still bound it; await_gate's own ~120s presence-loss timeout covers "SPA never opened."
  3. NEW static lint test `test/run-next-loops-bounded.test.mjs`. Parses `haiku_run_next.ts` source, finds every `while (...)` opener, asserts each loop body contains an iter-counter-gate OR no-progress sig check OR bounded loop condition. If a future contributor adds a `while` without one of those, the test fails before CI lets the change ship. This is the lint rule the user requested — runtime tests can't enumerate every input that might wedge a loop; the static shape covers all inputs by construction.

Test plan

  • 4 new lint cases pass against current codebase (every existing while is gated).
  • Full suite: 1627 (was 1623).
  • MCP boot smoke test passes.

🤖 Generated with Claude Code

jwaldrip and others added 4 commits May 15, 2026 09:13
…e infinite loops

Per the goal "ensure nothing in our engine can put us in an infinite
loop, that includes an internal loop to the call itself, or an agent
loop that cannot progress."

The deadlock detector previously emitted telemetry only — useful for
dashboards, useless for prevention. The user could still get stuck in
loops because nothing actually stopped the engine from returning the
same action again. PR #366 fixed the proximate FB-as-witness loop;
this PR fixes the structural floor.

What's new:
- `wouldDeadlock(slug, action)` — pre-emit predicate on
  deadlock-detector.ts. Returns a verdict when:
    * the SAME signature would fire HALT_THRESHOLD (4) consecutive
      times — the agent has re-ticked through the same payload three
      times already, the 4th MUST be different state or it's wedged.
    * the recent window of CHURN_HALT_MIN_TICKS (8) signatures cycles
      through ≤ CHURN_MAX_DISTINCT (2) values — the A↔B alternation
      pattern.
- `buildLoopHaltAction(slug, verdict)` — produces a concrete
  `loop_halted` action with the offending signature inline + a
  recovery script for the agent.
- `broadcastTick` (run-tick.ts) now calls wouldDeadlock BEFORE
  recording or broadcasting. When the verdict fires, it swaps the
  cursor's action for the halt action. The agent reads
  `loop_halted`, surfaces the halt to the user, and stops re-ticking.
  Once the underlying state changes (different signature on the next
  tick), the halt lifts automatically.
- New `prompts/global/loop_halted/` — eta-templated agent directive
  that's explicit about NOT auto-recovering. Registered in the
  prompt index.

What this guarantees:
- The same action signature CANNOT be returned more than 4
  consecutive times. Even if every other prevention misses (drift
  dedup, bolt cap, migration idempotency, witness refresh), the
  detector forces a halt.
- A↔B alternation is bounded at 8 consecutive ticks.
- The halt is recoverable: on a different signature, the counter
  resets. No on-disk state needed; in-memory across the same MCP
  process.

Tests:
- deadlock-detector.test.mjs: 5 new cases pinning wouldDeadlock +
  buildLoopHaltAction — null when no prior, fires on
  HALT_THRESHOLD-1 + 1, fires on churn, halt action surfaces the
  signature, fresh intent doesn't false-positive.
- Full suite: 1622 passed (was 1617).
- Boot smoke test: bundle still launches cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rchitectural

# Conflicts:
#	plugin/bin/haiku.mjs
Three issues from the bot review:

1. Migration error path + select_* gates bypassed broadcastTick.
   The "every return path funnels through here" claim in the
   broadcastTick docstring was false — four early returns went
   around it. The migration error case is the load-bearing one:
   if migrateIntent throws on every tick, runWorkflowTick would
   return the same `action: "error"` indefinitely with no halting
   mechanism. Now all four paths route through broadcastTick →
   wouldDeadlock fires after HALT_THRESHOLD repeats.

2. loop_halted in the recent window defeated subsequent churn
   detection for ~CHURN_WINDOW more ticks. recordTickResult now
   detects the halt marker and skips appending it to recent[];
   the count + telemetry still fire so the halt is observed, but
   the alternation history stays clean and the next churn check
   doesn't false-negative.

3. Misleading test name — "returns null until HALT_THRESHOLD-1"
   was actually asserting the predicate FIRES. Renamed to "fires
   on the HALT_THRESHOLD-th consecutive identical tick".

Plus per the user question on Sentry routing + recoverability:
- New `haiku.deadlock.halted` telemetry fires from
  buildLoopHaltAction. Already-existing `haiku.deadlock.suspected`
  + `churn_suspected` are advisory; the new `halted` event is the
  hard-halt counterpart. All three flow to the OTLP / Sentry sink
  via emitTelemetry.
- The halt is recoverable by design: when the underlying state
  changes (a different action signature on the next tick), the
  consecutive counter resets automatically. No on-disk state to
  clean up. The agent reads `loop_halted`, surfaces to user, fix
  the underlying state, next tick lifts the halt.

New test covers the loop_halted-not-in-window contract.
Tests: 1623 passed (was 1622).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per user goal (2026-05-15): "no matter what circumstance, the
run_next call should never get stuck in a loop. It can block via a
SPA pop, but it CANNOT hang for infinite recursion or stuck while
loops, and you need to have tests and or lint rules that validate
any call to run_next cannot hang with near absolute certainty."

Three structural additions on top of the existing
`RUN_NEXT_LOOP_CAP=16` iteration cap:

1. Wall-clock deadline (`RUN_NEXT_WALL_DEADLINE_MS=90_000`).
   Captured once at handler entry. Every loop body checks it
   alongside the iter counter. If the whole call exceeds 90 s
   (real disk-walks finish in milliseconds; 90 s = wedged), the
   engine returns `loop_aborted` with the elapsed-ms diagnostic.
   New `wallDeadlineExceeded` predicate + `wallDeadlineAbortResponse`
   helper in `_loop_guard.ts`.

2. Wall-deadline applied to the three structural while-loops:
   `select_*`, `close_feedback`, `complete_stage`. Each loop's
   first-iteration code now does both `++iterations > RUN_NEXT_LOOP_CAP`
   AND `wallDeadlineExceeded(startedAt)` checks.

3. The `gate_review` / `user_gate` loop is DELIBERATELY exempt
   from the wall deadline — its body calls `await_gate`, the SPA-
   pop blocking primitive the user explicitly carved out. A real
   human-in-the-loop decision can take minutes; a 90-s deadline
   would falsely trip. The iteration cap + no-progress sig check
   bound this loop on the engine's side; await_gate's own
   presence-loss timeout (~120 s) covers "SPA never opened."

4. NEW STATIC TEST `test/run-next-loops-bounded.test.mjs`. Parses
   `haiku_run_next.ts` source, finds every `while (...)` opener,
   asserts each loop body contains an iter-counter-gate
   (`++X > RUN_NEXT_LOOP_CAP` + `loopAbortResponse`) OR a
   no-progress signature check OR a bounded loop condition. If a
   future contributor adds a `while (...)` without one of those,
   the test fails before CI lets the change ship. This is the
   "lint rule" the user requested — run-time tests can't enumerate
   every input that might wedge a loop, but the static shape
   covers all inputs by construction.

All 4 test cases pass against the current codebase. Tests: 1627
(was 1623; +4 new lint cases).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant