Skip to content

feat(dream): synthesize backfill loop hardening — configured zeros, bigint child ids, orchestrator-owned frontmatter, stderr heartbeats#2284

Open
brettdavies wants to merge 5 commits into
garrytan:masterfrom
brettdavies:feat/dream-synthesize-hardening
Open

feat(dream): synthesize backfill loop hardening — configured zeros, bigint child ids, orchestrator-owned frontmatter, stderr heartbeats#2284
brettdavies wants to merge 5 commits into
garrytan:masterfrom
brettdavies:feat/dream-synthesize-hardening

Conversation

@brettdavies

Copy link
Copy Markdown

Closes #2283

Summary

A hardening of src/core/cycle/synthesize.ts that closes the cluster of failure modes the operator hits during long-running backfills. The PR is one topic, "the dream synthesize backfill loop's reliability under operator-driven multi-day runs," with four behavior changes plus one structural prereq that fall under it:

  • Configured dream.synthesize.cooldown_hours = 0 and min_chars = 0 are honored instead of being silently coerced to defaults. Operators who set these to zero to disable the behavior get the behavior disabled.
  • The orchestrator's Map<number, ChildMeta> lookup is bigint-safe. postgres-js returns INT8 as bigint; the lookup now coerces to Number at every consumer site so it doesn't silently miss.
  • The orchestrator authoritatively stamps deterministic frontmatter (transcript_id, transcript_source, effective_date, date, chunk, dream_cycle_date) onto every dual-written page after each subagent's put_page settles. Subagents still own type / title / tags / body; the orchestrator owns only the fields it computes.
  • The orchestrator emits [dream:sync] N children pending, M active, K completed heartbeats to stderr every ~20s during the wait loop. Operators tailing gbrain dream --phase synthesize --json > /tmp/log now see liveness on stderr while stdout stays clean for the JSON consumer.
  • Structural prereq: DiscoveredTranscript gains a transcriptSource field (inferred from the grandparent dir during discovery, validated against a date-bucket convention) so the orchestrator-frontmatter merge has the field to stamp. A parallel PR (transcript-metadata pack) also ships this commit; whichever PR lands second sees it as a no-op (identical patch) and git auto-resolves.

Before this PR

Knob / surface Reported state Actual behavior
dream.synthesize.cooldown_hours = 0 gbrain config get returns 0 Phase reports 12h cooldown, skips re-runs accordingly
dream.synthesize.min_chars = 0 Same Same: the default replaces the configured zero
Persisted page frontmatter on Postgres "looks like a real page" Missing transcript_id, transcript_source, effective_date, date, chunk on every page; only the unconditional fallbacks survive
Orchestrator frontmatter merge Documented to merge deterministic fields Silently no-ops on Postgres because childMeta.get(bigint) misses (2719n !== 2719); on PGLite it works because the engine returns number
Operator visibility during synthesize wait None Stderr silent for entire wait period (10+ min on big backfills)

After this PR

Knob / surface Behavior
dream.synthesize.cooldown_hours = 0 Disables cooldown; every cycle synthesizes
dream.synthesize.min_chars = 0 Disables the minimum-length skip; every transcript synthesizes
Persisted page frontmatter Deterministic fields stamped by orchestrator, lockstep with disk markdown; subagent drift can't leak
Orchestrator frontmatter merge Works identically on Postgres and PGLite
Operator visibility Heartbeat line on stderr every ~20s during the wait

Diagnosis (per behavior change)

Configured zeros. The reader uses parseInt(str, 10) || DEFAULT. parseInt('0', 10) || 12 evaluates as 0 || 12 = 12 because 0 is falsy. Replaced with an explicit Number.isFinite check via an IIFE so a configured 0 makes it through. Same pattern at the adjacent min_chars site.

Bigint child ids. The synthesize orchestrator builds childMeta as Map<number, ChildMeta> keyed off child.id (a JS number). At reverse-write time the orchestrator reads subagent_tool_executions.job_id (INT8) back from postgres-js and looks up via childMeta.get(jobId). postgres-js returns INT8 as bigint by default; Map.get uses SameValueZero equality, and 2719n !== 2719, so every lookup silently returns undefined, overrides defaulted to {}, and buildDeterministicFrontmatter(meta, cycleDate) was never reached. Fixed by coercing to Number at every Map.get site (reverseWriteRefs, collectChildPutPageSlugs, the slug-map re-insert).

Orchestrator-owned frontmatter. Subagents previously owned their full output frontmatter, including fields the operator wants the orchestrator to stamp authoritatively. Subagent drift (typos, omissions, model-side reformatting) leaked into the persisted row and the disk markdown. Fixed by re-rendering each row to disk markdown via reverseWriteRefs AND merging the deterministic frontmatter into the DB row via engine.putPage after the subagent's put_page lands. Both writes happen on both sides so disk and DB stay in lockstep. Subagents still own the content-semantic fields.

Stderr heartbeats. The wait loop polls the queue at intervals and only logs on state transitions. With --json the stdout pipe is silent (envelope only at completion). Added a ~20s stderr heartbeat [dream:sync] N children pending, M active, K completed so a tmux pane shows liveness without opening the DB.

Reproduction

Cluster reproduces during operator-driven multi-day backfills. Per-failure-mode repros:

  1. gbrain config set dream.synthesize.cooldown_hours 0, run synthesize twice on the same day, observe second run skips with "12h cooldown not elapsed."
  2. Run any synthesize cycle against Postgres, query persisted pages, observe missing transcript_id / transcript_source / effective_date / date / chunk on every page.
  3. Same DB query after the patch shows the fields populated and matching what the orchestrator computed.
  4. gbrain dream --phase synthesize --json > /tmp/log in a tmux pane; stderr silent during wait pre-patch, heartbeats every ~20s post-patch.

Tests

  • test/cycle-synthesize.test.ts: existing wait-loop tests continue to pass; new cases cover the configured-zero reads, the bigint Map.get coercion, and the orchestrator-merge dual-write.
  • bun run typecheck clean. bun run verify green.
  • Verified end-to-end on a production-shaped brain (Postgres 16) running a multi-day backfill loop: every persisted page lands with full deterministic frontmatter; stderr heartbeats fire on every wait window.

Why one PR for four changes

All four live in src/core/cycle/synthesize.ts. All four reproduce together during the same operator-driven backfill the moment any one surfaces. The orchestrator-frontmatter fix is functionally dependent on the bigint coercion (without the coercion, the merge silently no-ops on Postgres). The configured-zeros and stderr-heartbeat fixes are operator-facing and orthogonal in scope, but they ride with the others because the operator that hits the dual-write bug is the same operator running the long backfill with cooldown_hours = 0.

Splitting would create a four-PR chain with an interleaved-merge dependency on shipment order. Bundling them as "the synthesize backfill loop hardening pack" lands the same code with one review surface and one regression test bump.

Adjacent observation (not in this PR)

The synthesize phase currently batches dual-write at end of chunk instead of per-subagent-completion. When the orchestrator dies mid-chunk, every completed subagent's frontmatter merge AND disk write is lost. The per-child flush is a separate architectural change (separate PR), but it would compose cleanly with the orchestrator-owned frontmatter logic in this PR.

Discovery now reports which archive the transcript came from by reading the directory two levels above the file (the parent of the date directory). The canonical claude-code-archive layout is `<corpus>/<source>/<date>/<id>.md`, so a file at `~/.gbrain/transcripts/claude-code/2026-05-19/abc.md` yields `transcriptSource: 'claude-code'`. Returns null when the path doesn't match a `<source>/<date>/` pair, which keeps ad-hoc `--input` runs from inventing a source name.

Source name validation is intentionally strict: lowercase alphanumeric with hyphen separators (matches the existing slug-segment regex). Anything that fails the check returns null rather than a half-baked guess.

Step one of the orchestrator-side-frontmatter work. The downstream synthesize phase will read this to write `transcript_source` into reflection / original frontmatter directly, instead of forcing the Sonnet subagent to invent a source label (which has been drifting: one reflection said `claude-code transcript <uuid>`, another just said `<uuid>` with no archive hint).

Test fixtures updated to include the new required field.
…d values

Both config readers used `parseInt(str, 10) || DEFAULT` to apply a fallback. When the user sets the value to 0, `parseInt('0', 10)` returns 0, the `||` sees a falsy left side, and the fallback wins. Result: a deliberate 0 silently becomes the default — 12 hours for cooldown, 2000 chars for the minimum-transcript-length gate.

Switch both to a NaN check via `Number.isFinite(parsed)`. The fallback fires only on unset / malformed config, not on a legitimate 0.

Hit during a backfill where `gbrain config set dream.synthesize.cooldown_hours 0` was supposed to drop the cooldown so autopilot could process the backlog every tick. The setting persisted and `get` returned 0, but synthesize kept skipping with a "12h cooldown" message because the reader coerced 0 -> 12.
Stop asking the Sonnet subagent to invent frontmatter and have the orchestrator stamp every deterministic field directly. The subagent retains authority over `title` and `tags` because those genuinely need its judgment. Everything else — date, effective_date, transcript_source, transcript_id, transcript_hash, chunk, dream_generated, dream_cycle_date — is computed from the per-child state the orchestrator already tracks for chunk-slug rewriting.

The drift this fixes was unambiguous. Three reflections from one cycle:

```
2026-05-21-verbatim-consistency-over-clever-recasts-558bf0.md   no effective_date, no transcript_hash
2026-06-02-docs-discipline-as-release-surface-f1a913.md         transcript_hash present, no effective_date
2026-06-02-rebase-permission-boundary-97ed24.md                 effective_date present, transcript_hash_suffix instead of transcript_hash
```

Three reasonable choices the subagent could make, three incompatible shapes downstream. Queries that key on `effective_date` or `transcript_hash` silently missed rows.

What changed:

- `chunkInfo` (Map<jobId, {idx, hash6}>) renamed to `childMeta` and populated for every child, not just chunked ones. Carries idx, hash6, chunkTotal, transcriptSource, transcriptId, inferredDate.
- `collectChildPutPageSlugs` now returns `{ slug, source_id, jobId }` so the dual-write can pair each put_page back to the child that produced it.
- New `buildDeterministicFrontmatter(meta, cycleDate)` returns the override map.
- `reverseWriteRefs` now:
  - looks up the meta per slug via job_id,
  - persists the merged frontmatter back to the DB row via `engine.putPage` (so search index, doctor `effective_date_health`, and any consumer that queries frontmatter see the same fields as disk), and
  - threads the overrides into `renderPageToMarkdown` for the disk write.
- `renderPageToMarkdown` gains an `extraOverrides` parameter; the existing identity stamp stays in place and the orchestrator's per-page values land after it.
- `stripContentVersionSuffix` collapses claude-code-archive's `<uuid>--<contentHash>.md` form to the bare session UUID so edited transcripts share a `transcript_id` instead of inflating to one id per edit. Source-of-truth content version stays addressable via `transcript_hash`.

`type` and `title` are still owned by the subagent's put_page because `serializePageToMarkdown` reads them off the Page meta record rather than from the override map. The subagent has been picking those correctly across cycles, so leave that contract alone.
postgres.js returns `subagent_tool_executions.job_id` as bigint, but the queue.add return value (`child.id`) lands as a plain number when childMeta gets populated. Map.get(bigint) and Map.set(number) are not equivalent: V8 stores them under separate keys. Every meta lookup in reverseWriteRefs and collectChildPutPageSlugs silently returned undefined. The result on disk:

```yaml
type: reflection
title: ...
date: '2026-05-25T00:00:00.000Z'
source: claude-code transcript bd876882-d1ed-4275-865a-d5d07e464cc9
effective_date: '2026-05-25T00:00:00.000Z'
dream_generated: true
transcript_hash: 67f1c3
dream_cycle_date: '2026-06-16'
tags: [...]
```

dream_generated and dream_cycle_date show up because renderPageToMarkdown stamps them unconditionally as a fallback when the caller passes an empty overrides map. transcript_id and transcript_source, both stamped only when the meta lookup hits, were always missing.

Coerce r.job_id to Number at both lookup sites so the bigint flows are normalized at the boundary. Run the orchestrator side again on a sample claude-code transcript and the missing fields land:

```yaml
transcript_hash: 0615c3
dream_cycle_date: '2026-06-16'
transcript_id: 870565ae-d078-4859-86b6-fc9b896524fe
transcript_source: claude-code
```

Caught by a tactical `process.stderr.write` on the lookup that showed `jobId=2719 (type=bigint) metaFound=false`. Leaving a note here so the next bigint/number boundary surprise gets caught faster.
…ubagents

When the synthesize phase has dispatched its subagents and entered the
`for (const jobId of childIds)` poll loop, the operator running `gbrain
dream --phase synthesize --from X --to Y --json` saw nothing in the
terminal between `[cycle.synthesize] start` and the final JSON payload.
For a chunk of 50 subagents at concurrency=2, that is 15-25 minutes of
visual silence. A slow chunk and a stuck chunk look identical.

Add four stderr emissions plus a 20-second heartbeat:

1. Pre-loop: orchestrator waiting on N subagent(s)
2. Per-child terminal: subagent M/N status (job=X duration=Ys)
3. Background heartbeat every 20s: queue snapshot scoped to this chunk's
   child IDs (waiting / active / completed / dead / cancelled / failed
   counts), with elapsed time since the wait started
4. Pre-dual-write: all subagents settled; dual-writing N page(s) to disk
5. Post-dual-write: dual-write complete (N disk page(s))

The heartbeat queries minion_jobs scoped to the chunk's childIds so it
does not confuse cross-chunk activity (e.g. autopilot dispatching other
subagents in parallel). Best-effort: errors in the heartbeat are
swallowed so an aggregated SQL failure cannot crash the orchestrator.

Per-child emissions still bunch when several siblings finish during a
long wait on an earlier one (the for loop polls in childIds order, not
completion order). The heartbeat fills the gap so the operator sees
movement every 20 seconds regardless of which sibling is currently
holding up the serial poll.

Stdout stays clean for `--json` consumers because every emission goes
through process.stderr.write.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dream synthesize backfill loop drops configured zeros, bigint job ids, deterministic frontmatter, and operator visibility on long runs

1 participant