feat(dream): synthesize backfill loop hardening — configured zeros, bigint child ids, orchestrator-owned frontmatter, stderr heartbeats#2284
Open
brettdavies wants to merge 5 commits into
Conversation
Discovery now reports which archive the transcript came from by reading the directory two levels above the file (the parent of the date directory). The canonical claude-code-archive layout is `<corpus>/<source>/<date>/<id>.md`, so a file at `~/.gbrain/transcripts/claude-code/2026-05-19/abc.md` yields `transcriptSource: 'claude-code'`. Returns null when the path doesn't match a `<source>/<date>/` pair, which keeps ad-hoc `--input` runs from inventing a source name. Source name validation is intentionally strict: lowercase alphanumeric with hyphen separators (matches the existing slug-segment regex). Anything that fails the check returns null rather than a half-baked guess. Step one of the orchestrator-side-frontmatter work. The downstream synthesize phase will read this to write `transcript_source` into reflection / original frontmatter directly, instead of forcing the Sonnet subagent to invent a source label (which has been drifting: one reflection said `claude-code transcript <uuid>`, another just said `<uuid>` with no archive hint). Test fixtures updated to include the new required field.
…d values
Both config readers used `parseInt(str, 10) || DEFAULT` to apply a fallback. When the user sets the value to 0, `parseInt('0', 10)` returns 0, the `||` sees a falsy left side, and the fallback wins. Result: a deliberate 0 silently becomes the default — 12 hours for cooldown, 2000 chars for the minimum-transcript-length gate.
Switch both to a NaN check via `Number.isFinite(parsed)`. The fallback fires only on unset / malformed config, not on a legitimate 0.
Hit during a backfill where `gbrain config set dream.synthesize.cooldown_hours 0` was supposed to drop the cooldown so autopilot could process the backlog every tick. The setting persisted and `get` returned 0, but synthesize kept skipping with a "12h cooldown" message because the reader coerced 0 -> 12.
Stop asking the Sonnet subagent to invent frontmatter and have the orchestrator stamp every deterministic field directly. The subagent retains authority over `title` and `tags` because those genuinely need its judgment. Everything else — date, effective_date, transcript_source, transcript_id, transcript_hash, chunk, dream_generated, dream_cycle_date — is computed from the per-child state the orchestrator already tracks for chunk-slug rewriting.
The drift this fixes was unambiguous. Three reflections from one cycle:
```
2026-05-21-verbatim-consistency-over-clever-recasts-558bf0.md no effective_date, no transcript_hash
2026-06-02-docs-discipline-as-release-surface-f1a913.md transcript_hash present, no effective_date
2026-06-02-rebase-permission-boundary-97ed24.md effective_date present, transcript_hash_suffix instead of transcript_hash
```
Three reasonable choices the subagent could make, three incompatible shapes downstream. Queries that key on `effective_date` or `transcript_hash` silently missed rows.
What changed:
- `chunkInfo` (Map<jobId, {idx, hash6}>) renamed to `childMeta` and populated for every child, not just chunked ones. Carries idx, hash6, chunkTotal, transcriptSource, transcriptId, inferredDate.
- `collectChildPutPageSlugs` now returns `{ slug, source_id, jobId }` so the dual-write can pair each put_page back to the child that produced it.
- New `buildDeterministicFrontmatter(meta, cycleDate)` returns the override map.
- `reverseWriteRefs` now:
- looks up the meta per slug via job_id,
- persists the merged frontmatter back to the DB row via `engine.putPage` (so search index, doctor `effective_date_health`, and any consumer that queries frontmatter see the same fields as disk), and
- threads the overrides into `renderPageToMarkdown` for the disk write.
- `renderPageToMarkdown` gains an `extraOverrides` parameter; the existing identity stamp stays in place and the orchestrator's per-page values land after it.
- `stripContentVersionSuffix` collapses claude-code-archive's `<uuid>--<contentHash>.md` form to the bare session UUID so edited transcripts share a `transcript_id` instead of inflating to one id per edit. Source-of-truth content version stays addressable via `transcript_hash`.
`type` and `title` are still owned by the subagent's put_page because `serializePageToMarkdown` reads them off the Page meta record rather than from the override map. The subagent has been picking those correctly across cycles, so leave that contract alone.
postgres.js returns `subagent_tool_executions.job_id` as bigint, but the queue.add return value (`child.id`) lands as a plain number when childMeta gets populated. Map.get(bigint) and Map.set(number) are not equivalent: V8 stores them under separate keys. Every meta lookup in reverseWriteRefs and collectChildPutPageSlugs silently returned undefined. The result on disk: ```yaml type: reflection title: ... date: '2026-05-25T00:00:00.000Z' source: claude-code transcript bd876882-d1ed-4275-865a-d5d07e464cc9 effective_date: '2026-05-25T00:00:00.000Z' dream_generated: true transcript_hash: 67f1c3 dream_cycle_date: '2026-06-16' tags: [...] ``` dream_generated and dream_cycle_date show up because renderPageToMarkdown stamps them unconditionally as a fallback when the caller passes an empty overrides map. transcript_id and transcript_source, both stamped only when the meta lookup hits, were always missing. Coerce r.job_id to Number at both lookup sites so the bigint flows are normalized at the boundary. Run the orchestrator side again on a sample claude-code transcript and the missing fields land: ```yaml transcript_hash: 0615c3 dream_cycle_date: '2026-06-16' transcript_id: 870565ae-d078-4859-86b6-fc9b896524fe transcript_source: claude-code ``` Caught by a tactical `process.stderr.write` on the lookup that showed `jobId=2719 (type=bigint) metaFound=false`. Leaving a note here so the next bigint/number boundary surprise gets caught faster.
…ubagents When the synthesize phase has dispatched its subagents and entered the `for (const jobId of childIds)` poll loop, the operator running `gbrain dream --phase synthesize --from X --to Y --json` saw nothing in the terminal between `[cycle.synthesize] start` and the final JSON payload. For a chunk of 50 subagents at concurrency=2, that is 15-25 minutes of visual silence. A slow chunk and a stuck chunk look identical. Add four stderr emissions plus a 20-second heartbeat: 1. Pre-loop: orchestrator waiting on N subagent(s) 2. Per-child terminal: subagent M/N status (job=X duration=Ys) 3. Background heartbeat every 20s: queue snapshot scoped to this chunk's child IDs (waiting / active / completed / dead / cancelled / failed counts), with elapsed time since the wait started 4. Pre-dual-write: all subagents settled; dual-writing N page(s) to disk 5. Post-dual-write: dual-write complete (N disk page(s)) The heartbeat queries minion_jobs scoped to the chunk's childIds so it does not confuse cross-chunk activity (e.g. autopilot dispatching other subagents in parallel). Best-effort: errors in the heartbeat are swallowed so an aggregated SQL failure cannot crash the orchestrator. Per-child emissions still bunch when several siblings finish during a long wait on an earlier one (the for loop polls in childIds order, not completion order). The heartbeat fills the gap so the operator sees movement every 20 seconds regardless of which sibling is currently holding up the serial poll. Stdout stays clean for `--json` consumers because every emission goes through process.stderr.write.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2283
Summary
A hardening of
src/core/cycle/synthesize.tsthat closes the cluster of failure modes the operator hits during long-running backfills. The PR is one topic, "the dream synthesize backfill loop's reliability under operator-driven multi-day runs," with four behavior changes plus one structural prereq that fall under it:dream.synthesize.cooldown_hours = 0andmin_chars = 0are honored instead of being silently coerced to defaults. Operators who set these to zero to disable the behavior get the behavior disabled.Map<number, ChildMeta>lookup is bigint-safe.postgres-jsreturnsINT8asbigint; the lookup now coerces toNumberat every consumer site so it doesn't silently miss.transcript_id,transcript_source,effective_date,date,chunk,dream_cycle_date) onto every dual-written page after each subagent'sput_pagesettles. Subagents still owntype/title/tags/ body; the orchestrator owns only the fields it computes.[dream:sync] N children pending, M active, K completedheartbeats to stderr every ~20s during the wait loop. Operators tailinggbrain dream --phase synthesize --json > /tmp/lognow see liveness on stderr while stdout stays clean for the JSON consumer.DiscoveredTranscriptgains atranscriptSourcefield (inferred from the grandparent dir during discovery, validated against a date-bucket convention) so the orchestrator-frontmatter merge has the field to stamp. A parallel PR (transcript-metadata pack) also ships this commit; whichever PR lands second sees it as a no-op (identical patch) and git auto-resolves.Before this PR
dream.synthesize.cooldown_hours = 0gbrain config getreturns012h cooldown, skips re-runs accordinglydream.synthesize.min_chars = 0transcript_id,transcript_source,effective_date,date,chunkon every page; only the unconditional fallbacks survivechildMeta.get(bigint)misses (2719n !== 2719); on PGLite it works because the engine returnsnumberAfter this PR
dream.synthesize.cooldown_hours = 0dream.synthesize.min_chars = 0Diagnosis (per behavior change)
Configured zeros. The reader uses
parseInt(str, 10) || DEFAULT.parseInt('0', 10) || 12evaluates as0 || 12 = 12because0is falsy. Replaced with an explicitNumber.isFinitecheck via an IIFE so a configured0makes it through. Same pattern at the adjacentmin_charssite.Bigint child ids. The synthesize orchestrator builds
childMetaasMap<number, ChildMeta>keyed offchild.id(a JSnumber). At reverse-write time the orchestrator readssubagent_tool_executions.job_id(INT8) back from postgres-js and looks up viachildMeta.get(jobId). postgres-js returns INT8 asbigintby default;Map.getuses SameValueZero equality, and2719n !== 2719, so every lookup silently returnsundefined,overridesdefaulted to{}, andbuildDeterministicFrontmatter(meta, cycleDate)was never reached. Fixed by coercing toNumberat everyMap.getsite (reverseWriteRefs,collectChildPutPageSlugs, the slug-map re-insert).Orchestrator-owned frontmatter. Subagents previously owned their full output frontmatter, including fields the operator wants the orchestrator to stamp authoritatively. Subagent drift (typos, omissions, model-side reformatting) leaked into the persisted row and the disk markdown. Fixed by re-rendering each row to disk markdown via
reverseWriteRefsAND merging the deterministic frontmatter into the DB row viaengine.putPageafter the subagent'sput_pagelands. Both writes happen on both sides so disk and DB stay in lockstep. Subagents still own the content-semantic fields.Stderr heartbeats. The wait loop polls the queue at intervals and only logs on state transitions. With
--jsonthe stdout pipe is silent (envelope only at completion). Added a ~20s stderr heartbeat[dream:sync] N children pending, M active, K completedso a tmux pane shows liveness without opening the DB.Reproduction
Cluster reproduces during operator-driven multi-day backfills. Per-failure-mode repros:
gbrain config set dream.synthesize.cooldown_hours 0, run synthesize twice on the same day, observe second run skips with "12h cooldown not elapsed."transcript_id/transcript_source/effective_date/date/chunkon every page.gbrain dream --phase synthesize --json > /tmp/login a tmux pane; stderr silent during wait pre-patch, heartbeats every ~20s post-patch.Tests
test/cycle-synthesize.test.ts: existing wait-loop tests continue to pass; new cases cover the configured-zero reads, the bigintMap.getcoercion, and the orchestrator-merge dual-write.bun run typecheckclean.bun run verifygreen.Why one PR for four changes
All four live in
src/core/cycle/synthesize.ts. All four reproduce together during the same operator-driven backfill the moment any one surfaces. The orchestrator-frontmatter fix is functionally dependent on the bigint coercion (without the coercion, the merge silently no-ops on Postgres). The configured-zeros and stderr-heartbeat fixes are operator-facing and orthogonal in scope, but they ride with the others because the operator that hits the dual-write bug is the same operator running the long backfill withcooldown_hours = 0.Splitting would create a four-PR chain with an interleaved-merge dependency on shipment order. Bundling them as "the synthesize backfill loop hardening pack" lands the same code with one review surface and one regression test bump.
Adjacent observation (not in this PR)
The synthesize phase currently batches dual-write at end of chunk instead of per-subagent-completion. When the orchestrator dies mid-chunk, every completed subagent's frontmatter merge AND disk write is lost. The per-child flush is a separate architectural change (separate PR), but it would compose cleanly with the orchestrator-owned frontmatter logic in this PR.