Skip to content

fix(cycle): bound propose_takes phase with per-call + phase-level deadlines#2262

Open
tschew72 wants to merge 1 commit into
garrytan:masterfrom
tschew72:fix/propose-takes-deadline-and-timeout
Open

fix(cycle): bound propose_takes phase with per-call + phase-level deadlines#2262
tschew72 wants to merge 1 commit into
garrytan:masterfrom
tschew72:fix/propose-takes-deadline-and-timeout

Conversation

@tschew72

@tschew72 tschew72 commented Jun 18, 2026

Copy link
Copy Markdown

Summary

The `propose_takes` phase was being killed by SIGTERM from outer `timeout 600` wrappers in cron-driven dream runs. Root cause: `pageLimit=100` default combined with the unbounded `gateway.chat` call in `defaultExtractor` made the phase routinely take 50+ minutes.

This PR introduces three layered bounds that turn a hard-kill SIGTERM into a clean partial result:

  1. `EXTRACTOR_CALL_TIMEOUT_MS = 90_000` — per-call `AbortSignal.timeout` on the extractor. A stalled provider socket aborts the single call; the page is logged as a warning; the phase continues.
  2. `PHASE_DEADLINE_MS = 30 * 60 * 1000` — phase wall-clock deadline checked inside the page loop. Phase returns `deadline_hit: true` instead of being killed by an outer wrapper.
  3. `pageLimit` default 100 → 30 — 30 pages × ~30s = 15 min fits inside the 30-min phase deadline and a $5 budget.

What changes

  • `src/core/cycle/propose-takes.ts` — 53 insertions, 1 deletion
  • `ProposeTakesResult` gains optional `deadline_hit?: boolean` field (older consumers ignore it)

Why this matters

Surfaced via the `propose_takes aborted SIGTERM` pattern that hit the nightly dream cycle from 2026-06-15 onward (post-v0.42.20.0 ship, when extraction prompts moved to the production path). After this fix, the phase returns a partial result with whatever proposals it managed to insert before the deadline — making dream runs resumable and the lost proposals observable in the dream log.

Test plan

  • Live test on our 219-page brain: phase completed cleanly in ~4 min with `pageLimit=30`
  • No `propose_takes aborted SIGTERM` in dream.log since 2026-06-18
  • Brain score 83/100 preserved through the change
  • Need upstream CI to confirm on the larger hermetic test set

Reproduction (before fix)

```bash
On a brain with >100 pages, run:
gbrain dream --phase propose_takes
Watch dream.log tail — see propose_takes aborted SIGTERM after ~10 min
(cron wrapper kills the process before page loop completes)
```

Fix verification

```bash
After this PR:
gbrain dream --phase propose_takes
Phase completes in ~4 min with proposals_inserted: 28
Or, on a slow LLM day, returns deadline_hit: true with whatever it got
```

…dlines

The propose_takes phase was being killed by SIGTERM from outer `timeout 600`
wrappers in cron-driven dream runs. Root cause analysis from dream.log
(2026-06-15 onward, post-v0.42.20.0 ship): the default `pageLimit=100`
combined with the unbounded `gateway.chat` call in `defaultExtractor` made
the phase routinely take 50+ minutes on 100 pages, far past the 10-min
cron budget.

This commit introduces three layered bounds that together turn a
hard-kill SIGTERM into a clean partial result with a `deadline_hit` flag:

1. **EXTRACTOR_CALL_TIMEOUT_MS = 90_000** (per-call AbortSignal.timeout)
   The default gateway timeout (300s) is too generous for short extraction
   prompts — 90s is "something is wrong" territory. A stalled provider
   socket now aborts the single call, the page is logged as a warning,
   and the phase continues. Mirrors `withDefaultTimeout` in core/ai/gateway.ts.

2. **PHASE_DEADLINE_MS = 30 * 60 * 1000** (phase wall-clock)
   Even with per-call bound, slow-but-completing responses (rate-limit
   retries, gateway queueing) can accumulate. 30 min matches patterns.ts
   and guarantees the phase either completes cleanly or returns a
   partial result with `deadline_hit: true`. The check is a single
   `Date.now() - phaseStartMs` comparison inside the page loop — O(1)
   per page, no scheduler overhead.

3. **pageLimit default 100 → 30**
   100 pages × ~30s/extract = 50 min, which is what was blowing the
   budget. 30 pages × ~30s = 15 min, fits inside both the 30-min
   phase deadline and a $5 budget comfortably (~45K input tokens).
   Callers that need more (drain mode, off-hours) can opt in via
   `opts.pageLimit`.

Surfaced via the `propose_takes aborted SIGTERM` pattern that hit the
nightly dream cycle from 2026-06-15 to 2026-06-17. After this fix, the
phase should return `deadline_hit: true` with whatever proposals it
managed to insert before the deadline, instead of being killed mid-loop
by the outer wrapper — making dream runs resumable and the lost
proposals observable.

ProposeTakesResult gains a new optional `deadline_hit?: boolean` field
for callers that want to distinguish "completed" from "deadline-exceeded
partial". Older consumers ignore the new field.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant