fix(cycle): bound propose_takes phase with per-call + phase-level deadlines#2262
Open
tschew72 wants to merge 1 commit into
Open
fix(cycle): bound propose_takes phase with per-call + phase-level deadlines#2262tschew72 wants to merge 1 commit into
tschew72 wants to merge 1 commit into
Conversation
…dlines The propose_takes phase was being killed by SIGTERM from outer `timeout 600` wrappers in cron-driven dream runs. Root cause analysis from dream.log (2026-06-15 onward, post-v0.42.20.0 ship): the default `pageLimit=100` combined with the unbounded `gateway.chat` call in `defaultExtractor` made the phase routinely take 50+ minutes on 100 pages, far past the 10-min cron budget. This commit introduces three layered bounds that together turn a hard-kill SIGTERM into a clean partial result with a `deadline_hit` flag: 1. **EXTRACTOR_CALL_TIMEOUT_MS = 90_000** (per-call AbortSignal.timeout) The default gateway timeout (300s) is too generous for short extraction prompts — 90s is "something is wrong" territory. A stalled provider socket now aborts the single call, the page is logged as a warning, and the phase continues. Mirrors `withDefaultTimeout` in core/ai/gateway.ts. 2. **PHASE_DEADLINE_MS = 30 * 60 * 1000** (phase wall-clock) Even with per-call bound, slow-but-completing responses (rate-limit retries, gateway queueing) can accumulate. 30 min matches patterns.ts and guarantees the phase either completes cleanly or returns a partial result with `deadline_hit: true`. The check is a single `Date.now() - phaseStartMs` comparison inside the page loop — O(1) per page, no scheduler overhead. 3. **pageLimit default 100 → 30** 100 pages × ~30s/extract = 50 min, which is what was blowing the budget. 30 pages × ~30s = 15 min, fits inside both the 30-min phase deadline and a $5 budget comfortably (~45K input tokens). Callers that need more (drain mode, off-hours) can opt in via `opts.pageLimit`. Surfaced via the `propose_takes aborted SIGTERM` pattern that hit the nightly dream cycle from 2026-06-15 to 2026-06-17. After this fix, the phase should return `deadline_hit: true` with whatever proposals it managed to insert before the deadline, instead of being killed mid-loop by the outer wrapper — making dream runs resumable and the lost proposals observable. ProposeTakesResult gains a new optional `deadline_hit?: boolean` field for callers that want to distinguish "completed" from "deadline-exceeded partial". Older consumers ignore the new field.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The `propose_takes` phase was being killed by SIGTERM from outer `timeout 600` wrappers in cron-driven dream runs. Root cause: `pageLimit=100` default combined with the unbounded `gateway.chat` call in `defaultExtractor` made the phase routinely take 50+ minutes.
This PR introduces three layered bounds that turn a hard-kill SIGTERM into a clean partial result:
What changes
Why this matters
Surfaced via the `propose_takes aborted SIGTERM` pattern that hit the nightly dream cycle from 2026-06-15 onward (post-v0.42.20.0 ship, when extraction prompts moved to the production path). After this fix, the phase returns a partial result with whatever proposals it managed to insert before the deadline — making dream runs resumable and the lost proposals observable in the dream log.
Test plan
Reproduction (before fix)
```bash
On a brain with >100 pages, run:
gbrain dream --phase propose_takes
Watch dream.log tail — see
propose_takes aborted SIGTERMafter ~10 min(cron wrapper kills the process before page loop completes)
```
Fix verification
```bash
After this PR:
gbrain dream --phase propose_takes
Phase completes in ~4 min with proposals_inserted: 28
Or, on a slow LLM day, returns deadline_hit: true with whatever it got
```