feat(dream): orchestrator-owned transcript metadata + transcriptSource discovery field#2286
Open
brettdavies wants to merge 2 commits into
Open
feat(dream): orchestrator-owned transcript metadata + transcriptSource discovery field#2286brettdavies wants to merge 2 commits into
brettdavies wants to merge 2 commits into
Conversation
Discovery now reports which archive the transcript came from by reading the directory two levels above the file (the parent of the date directory). The canonical claude-code-archive layout is `<corpus>/<source>/<date>/<id>.md`, so a file at `~/.gbrain/transcripts/claude-code/2026-05-19/abc.md` yields `transcriptSource: 'claude-code'`. Returns null when the path doesn't match a `<source>/<date>/` pair, which keeps ad-hoc `--input` runs from inventing a source name. Source name validation is intentionally strict: lowercase alphanumeric with hyphen separators (matches the existing slug-segment regex). Anything that fails the check returns null rather than a half-baked guess. Step one of the orchestrator-side-frontmatter work. The downstream synthesize phase will read this to write `transcript_source` into reflection / original frontmatter directly, instead of forcing the Sonnet subagent to invent a source label (which has been drifting: one reflection said `claude-code transcript <uuid>`, another just said `<uuid>` with no archive hint). Test fixtures updated to include the new required field.
Stop asking the Sonnet subagent to invent frontmatter and have the orchestrator stamp every deterministic field directly. The subagent retains authority over `title` and `tags` because those genuinely need its judgment. Everything else — date, effective_date, transcript_source, transcript_id, transcript_hash, chunk, dream_generated, dream_cycle_date — is computed from the per-child state the orchestrator already tracks for chunk-slug rewriting.
The drift this fixes was unambiguous. Three reflections from one cycle:
```
2026-05-21-verbatim-consistency-over-clever-recasts-558bf0.md no effective_date, no transcript_hash
2026-06-02-docs-discipline-as-release-surface-f1a913.md transcript_hash present, no effective_date
2026-06-02-rebase-permission-boundary-97ed24.md effective_date present, transcript_hash_suffix instead of transcript_hash
```
Three reasonable choices the subagent could make, three incompatible shapes downstream. Queries that key on `effective_date` or `transcript_hash` silently missed rows.
What changed:
- `chunkInfo` (Map<jobId, {idx, hash6}>) renamed to `childMeta` and populated for every child, not just chunked ones. Carries idx, hash6, chunkTotal, transcriptSource, transcriptId, inferredDate.
- `collectChildPutPageSlugs` now returns `{ slug, source_id, jobId }` so the dual-write can pair each put_page back to the child that produced it.
- New `buildDeterministicFrontmatter(meta, cycleDate)` returns the override map.
- `reverseWriteRefs` now:
- looks up the meta per slug via job_id,
- persists the merged frontmatter back to the DB row via `engine.putPage` (so search index, doctor `effective_date_health`, and any consumer that queries frontmatter see the same fields as disk), and
- threads the overrides into `renderPageToMarkdown` for the disk write.
- `renderPageToMarkdown` gains an `extraOverrides` parameter; the existing identity stamp stays in place and the orchestrator's per-page values land after it.
- `stripContentVersionSuffix` collapses claude-code-archive's `<uuid>--<contentHash>.md` form to the bare session UUID so edited transcripts share a `transcript_id` instead of inflating to one id per edit. Source-of-truth content version stays addressable via `transcript_hash`.
`type` and `title` are still owned by the subagent's put_page because `serializePageToMarkdown` reads them off the Page meta record rather than from the override map. The subagent has been picking those correctly across cycles, so leave that contract alone.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2285
Summary
The dream synthesize phase now carries transcript metadata end-to-end without subagent drift and without dropping the conversation's actual date when the file gets re-synced. Three coupled changes that all live in transcript discovery + the synthesize orchestrator:
DiscoveredTranscriptcarries atranscriptSourcelabel (claude-code,voice-notes, etc.) inferred from the grandparent dir during discovery, so downstream consumers can filter by ingestion pipeline.## Metadata| First message |row from the transcript body when available, falling back to the filename-regex date and then to file mtime. Stable acrossrsync/ Dropbox / Syncthing / B2 re-syncs that re-stamp mtime.transcript_id,transcript_source,effective_date,date,chunk,dream_cycle_date) onto every dual-written page after each subagent'sput_pagesettles. Subagents still owntype/title/tags/ body; only the metadata fields are owned by the orchestrator.Before this PR
DiscoveredTranscript.transcriptSourceAfter this PR
DiscoveredTranscript.transcriptSourcebasename(dirname(dirname(filePath)))during discovery; available throughout the downstream pipeline## Metadatafirst-message-row is present; filename-regex applies next; mtime is the final fallback. Stable across re-syncs.Diagnosis (per change)
transcriptSource field. Discovery has the information (the dir name immediately above
<date>/<file>.mdis the source label by convention) butDiscoveredTranscriptdoesn't expose it. Adding the field + thebasename(dirname(dirname(filePath)))extraction makes it available to every consumer that already destructures the discovered transcript object.Content-based date inference. Claude-code transcripts include a
## Metadatatable with a| First message | 2026-05-15T03:51:11.584Z |row. That timestamp is stable across copies and re-syncs because every sync tool that re-stamps mtime preserves file content. Parsing the row when present gives a date that matches the operator's mental model ("when did this conversation happen?") rather than the file-on-disk's metadata fingerprint. Upstream's existing filename-regex date inference attranscript-discovery.ts:~261is preserved as the intermediate fallback so brains that name transcripts by date keep working without behavior change.Orchestrator-owned frontmatter. Subagents previously owned full frontmatter, including the fields the orchestrator computes deterministically. Drift (typos, omissions, model-side reformatting) leaked into the persisted row and the disk markdown. Fixed by re-rendering each row to disk markdown via
reverseWriteRefsAND merging the deterministic frontmatter into the DB row viaengine.putPageafter the subagent'sput_pagelands.Reproduction
~/.gbrain/transcripts/claude-code/2026-06-12/<a>.mdand~/.gbrain/transcripts/voice-notes/2026-06-12/<b>.md. Run synthesize; pre-patch the resulting pages have notranscript_sourcefield.touch ~/.gbrain/transcripts/claude-code/2026-06-12/<a>.mdbetween syncs. Re-run synthesize; pre-patch theeffective_dateis today's date rather than the original conversation date. Post-patch the page reads the## Metadata| First message |row and keeps the conversation date.Tests
test/cycle-synthesize.test.ts: new cases for the transcriptSource field, the content-date parser, the orchestrator merge.test/cycle/transcript-discovery.test.ts: new cases for the date inference cascade (content wins, filename fallback, mtime fallback).bun run typecheckclean.bun run verifygreen.transcript_sourcelabel and a date that matches the conversation rather than the file metadata.Coordination with the synthesize hardening pack
The
feat(dream): synthesize backfill loop hardeningPR (separate branchfeat/dream-synthesize-hardening) also ships the orchestrator-owned frontmatter commit. Per "duplicate commits across branches are fine, the second PR's duplicate becomes a no-op," whichever PR merges second sees that commit auto-resolve. Both PRs are defensible as single-topic: this PR is "transcript metadata pipeline," the other is "synthesize backfill loop hardening."Why one PR for three changes
All three live in the transcript-discovery → synthesize dual-write path. All three reproduce together on any brain that ingests transcripts from multiple sources or whose file timestamps drift relative to conversation start. Splitting would create a three-PR chain with an interleaved-merge dependency on shipment order (the orchestrator merge needs
transcriptSourceto stamp it). Bundling them as "the transcript metadata pipeline" lands the same code with one review surface.