fix(engine): wall-clock deadline + static lint test — run_next cannot hang by jwaldrip · Pull Request #368 · gigsmart/haiku-method

jwaldrip · 2026-05-15T16:30:45Z

Per user goal (2026-05-15): "no matter what circumstance, the run_next call should never get stuck in a loop. It can block via a SPA pop, but it CANNOT hang for infinite recursion or stuck while loops, and you need to have tests and or lint rules that validate any call to run_next cannot hang with near absolute certainty."

Summary

Three structural additions on top of the existing `RUN_NEXT_LOOP_CAP=16`:

Wall-clock deadline (`RUN_NEXT_WALL_DEADLINE_MS=90_000`). Captured once at handler entry. Three structural while-loops (`select_*`, `close_feedback`, `complete_stage`) check it alongside the iter counter. If the whole call exceeds 90s — real disk-walks finish in milliseconds; 90s = wedged — the engine returns `loop_aborted` with the elapsed-ms diagnostic.
`gate_review` loop is deliberately exempt: its body calls `await_gate`, the SPA-pop blocking primitive the user explicitly carved out. A real human-in-the-loop decision can take minutes. The iteration cap + no-progress signature check still bound it; await_gate's own ~120s presence-loss timeout covers "SPA never opened."
NEW static lint test `test/run-next-loops-bounded.test.mjs`. Parses `haiku_run_next.ts` source, finds every `while (...)` opener, asserts each loop body contains an iter-counter-gate OR no-progress sig check OR bounded loop condition. If a future contributor adds a `while` without one of those, the test fails before CI lets the change ship. This is the lint rule the user requested — runtime tests can't enumerate every input that might wedge a loop; the static shape covers all inputs by construction.

Test plan

4 new lint cases pass against current codebase (every existing while is gated).
Full suite: 1627 (was 1623).
MCP boot smoke test passes.

🤖 Generated with Claude Code

…e infinite loops Per the goal "ensure nothing in our engine can put us in an infinite loop, that includes an internal loop to the call itself, or an agent loop that cannot progress." The deadlock detector previously emitted telemetry only — useful for dashboards, useless for prevention. The user could still get stuck in loops because nothing actually stopped the engine from returning the same action again. PR #366 fixed the proximate FB-as-witness loop; this PR fixes the structural floor. What's new: - `wouldDeadlock(slug, action)` — pre-emit predicate on deadlock-detector.ts. Returns a verdict when: * the SAME signature would fire HALT_THRESHOLD (4) consecutive times — the agent has re-ticked through the same payload three times already, the 4th MUST be different state or it's wedged. * the recent window of CHURN_HALT_MIN_TICKS (8) signatures cycles through ≤ CHURN_MAX_DISTINCT (2) values — the A↔B alternation pattern. - `buildLoopHaltAction(slug, verdict)` — produces a concrete `loop_halted` action with the offending signature inline + a recovery script for the agent. - `broadcastTick` (run-tick.ts) now calls wouldDeadlock BEFORE recording or broadcasting. When the verdict fires, it swaps the cursor's action for the halt action. The agent reads `loop_halted`, surfaces the halt to the user, and stops re-ticking. Once the underlying state changes (different signature on the next tick), the halt lifts automatically. - New `prompts/global/loop_halted/` — eta-templated agent directive that's explicit about NOT auto-recovering. Registered in the prompt index. What this guarantees: - The same action signature CANNOT be returned more than 4 consecutive times. Even if every other prevention misses (drift dedup, bolt cap, migration idempotency, witness refresh), the detector forces a halt. - A↔B alternation is bounded at 8 consecutive ticks. - The halt is recoverable: on a different signature, the counter resets. No on-disk state needed; in-memory across the same MCP process. Tests: - deadlock-detector.test.mjs: 5 new cases pinning wouldDeadlock + buildLoopHaltAction — null when no prior, fires on HALT_THRESHOLD-1 + 1, fires on churn, halt action surfaces the signature, fresh intent doesn't false-positive. - Full suite: 1622 passed (was 1617). - Boot smoke test: bundle still launches cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rchitectural # Conflicts: # plugin/bin/haiku.mjs

Three issues from the bot review: 1. Migration error path + select_* gates bypassed broadcastTick. The "every return path funnels through here" claim in the broadcastTick docstring was false — four early returns went around it. The migration error case is the load-bearing one: if migrateIntent throws on every tick, runWorkflowTick would return the same `action: "error"` indefinitely with no halting mechanism. Now all four paths route through broadcastTick → wouldDeadlock fires after HALT_THRESHOLD repeats. 2. loop_halted in the recent window defeated subsequent churn detection for ~CHURN_WINDOW more ticks. recordTickResult now detects the halt marker and skips appending it to recent[]; the count + telemetry still fire so the halt is observed, but the alternation history stays clean and the next churn check doesn't false-negative. 3. Misleading test name — "returns null until HALT_THRESHOLD-1" was actually asserting the predicate FIRES. Renamed to "fires on the HALT_THRESHOLD-th consecutive identical tick". Plus per the user question on Sentry routing + recoverability: - New `haiku.deadlock.halted` telemetry fires from buildLoopHaltAction. Already-existing `haiku.deadlock.suspected` + `churn_suspected` are advisory; the new `halted` event is the hard-halt counterpart. All three flow to the OTLP / Sentry sink via emitTelemetry. - The halt is recoverable by design: when the underlying state changes (a different action signature on the next tick), the consecutive counter resets automatically. No on-disk state to clean up. The agent reads `loop_halted`, surfaces to user, fix the underlying state, next tick lifts the halt. New test covers the loop_halted-not-in-window contract. Tests: 1623 passed (was 1622). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per user goal (2026-05-15): "no matter what circumstance, the run_next call should never get stuck in a loop. It can block via a SPA pop, but it CANNOT hang for infinite recursion or stuck while loops, and you need to have tests and or lint rules that validate any call to run_next cannot hang with near absolute certainty." Three structural additions on top of the existing `RUN_NEXT_LOOP_CAP=16` iteration cap: 1. Wall-clock deadline (`RUN_NEXT_WALL_DEADLINE_MS=90_000`). Captured once at handler entry. Every loop body checks it alongside the iter counter. If the whole call exceeds 90 s (real disk-walks finish in milliseconds; 90 s = wedged), the engine returns `loop_aborted` with the elapsed-ms diagnostic. New `wallDeadlineExceeded` predicate + `wallDeadlineAbortResponse` helper in `_loop_guard.ts`. 2. Wall-deadline applied to the three structural while-loops: `select_*`, `close_feedback`, `complete_stage`. Each loop's first-iteration code now does both `++iterations > RUN_NEXT_LOOP_CAP` AND `wallDeadlineExceeded(startedAt)` checks. 3. The `gate_review` / `user_gate` loop is DELIBERATELY exempt from the wall deadline — its body calls `await_gate`, the SPA- pop blocking primitive the user explicitly carved out. A real human-in-the-loop decision can take minutes; a 90-s deadline would falsely trip. The iteration cap + no-progress sig check bound this loop on the engine's side; await_gate's own presence-loss timeout (~120 s) covers "SPA never opened." 4. NEW STATIC TEST `test/run-next-loops-bounded.test.mjs`. Parses `haiku_run_next.ts` source, finds every `while (...)` opener, asserts each loop body contains an iter-counter-gate (`++X > RUN_NEXT_LOOP_CAP` + `loopAbortResponse`) OR a no-progress signature check OR a bounded loop condition. If a future contributor adds a `while (...)` without one of those, the test fails before CI lets the change ship. This is the "lint rule" the user requested — run-time tests can't enumerate every input that might wedge a loop, but the static shape covers all inputs by construction. All 4 test cases pass against the current codebase. Tests: 1627 (was 1623; +4 new lint cases). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jwaldrip and others added 4 commits May 15, 2026 09:13

Merge remote-tracking branch 'origin/main' into fix/loop-prevention-a…

75a618c

…rchitectural # Conflicts: # plugin/bin/haiku.mjs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(engine): wall-clock deadline + static lint test — run_next cannot hang#368

fix(engine): wall-clock deadline + static lint test — run_next cannot hang#368
jwaldrip wants to merge 4 commits into
mainfrom
fix/run-next-loops-bounded

jwaldrip commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

jwaldrip commented May 15, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant