Skip to content

feat(dream): orchestrator-owned transcript metadata + transcriptSource discovery field#2286

Open
brettdavies wants to merge 2 commits into
garrytan:masterfrom
brettdavies:feat/dream-transcript-metadata
Open

feat(dream): orchestrator-owned transcript metadata + transcriptSource discovery field#2286
brettdavies wants to merge 2 commits into
garrytan:masterfrom
brettdavies:feat/dream-transcript-metadata

Conversation

@brettdavies

Copy link
Copy Markdown

Closes #2285

Summary

The dream synthesize phase now carries transcript metadata end-to-end without subagent drift and without dropping the conversation's actual date when the file gets re-synced. Three coupled changes that all live in transcript discovery + the synthesize orchestrator:

  • DiscoveredTranscript carries a transcriptSource label (claude-code, voice-notes, etc.) inferred from the grandparent dir during discovery, so downstream consumers can filter by ingestion pipeline.
  • Transcript date inference uses the ## Metadata | First message | row from the transcript body when available, falling back to the filename-regex date and then to file mtime. Stable across rsync / Dropbox / Syncthing / B2 re-syncs that re-stamp mtime.
  • The orchestrator stamps deterministic frontmatter (transcript_id, transcript_source, effective_date, date, chunk, dream_cycle_date) onto every dual-written page after each subagent's put_page settles. Subagents still own type / title / tags / body; only the metadata fields are owned by the orchestrator.

Before this PR

Surface Behavior
DiscoveredTranscript.transcriptSource Does not exist; no clean way to filter by ingestion source
Transcript date Inferred from file mtime; drifts when sync tools re-stamp
Persisted page frontmatter Subagent owns full frontmatter; orchestrator drift (typos, omissions, model reformatting) leaks

After this PR

Surface Behavior
DiscoveredTranscript.transcriptSource Populated from basename(dirname(dirname(filePath))) during discovery; available throughout the downstream pipeline
Transcript date Content-metadata wins when the ## Metadata first-message-row is present; filename-regex applies next; mtime is the final fallback. Stable across re-syncs.
Persisted page frontmatter Orchestrator stamps deterministic fields authoritatively, lockstep across DB and disk; subagent drift on those fields can't leak

Diagnosis (per change)

transcriptSource field. Discovery has the information (the dir name immediately above <date>/<file>.md is the source label by convention) but DiscoveredTranscript doesn't expose it. Adding the field + the basename(dirname(dirname(filePath))) extraction makes it available to every consumer that already destructures the discovered transcript object.

Content-based date inference. Claude-code transcripts include a ## Metadata table with a | First message | 2026-05-15T03:51:11.584Z | row. That timestamp is stable across copies and re-syncs because every sync tool that re-stamps mtime preserves file content. Parsing the row when present gives a date that matches the operator's mental model ("when did this conversation happen?") rather than the file-on-disk's metadata fingerprint. Upstream's existing filename-regex date inference at transcript-discovery.ts:~261 is preserved as the intermediate fallback so brains that name transcripts by date keep working without behavior change.

Orchestrator-owned frontmatter. Subagents previously owned full frontmatter, including the fields the orchestrator computes deterministically. Drift (typos, omissions, model-side reformatting) leaked into the persisted row and the disk markdown. Fixed by re-rendering each row to disk markdown via reverseWriteRefs AND merging the deterministic frontmatter into the DB row via engine.putPage after the subagent's put_page lands.

Reproduction

  1. Ingest two transcripts under different source dirs: ~/.gbrain/transcripts/claude-code/2026-06-12/<a>.md and ~/.gbrain/transcripts/voice-notes/2026-06-12/<b>.md. Run synthesize; pre-patch the resulting pages have no transcript_source field.
  2. touch ~/.gbrain/transcripts/claude-code/2026-06-12/<a>.md between syncs. Re-run synthesize; pre-patch the effective_date is today's date rather than the original conversation date. Post-patch the page reads the ## Metadata | First message | row and keeps the conversation date.
  3. Inspect any dual-written page; pre-patch the deterministic fields are missing or drift from what the orchestrator computed; post-patch they're populated and matching.

Tests

  • test/cycle-synthesize.test.ts: new cases for the transcriptSource field, the content-date parser, the orchestrator merge.
  • test/cycle/transcript-discovery.test.ts: new cases for the date inference cascade (content wins, filename fallback, mtime fallback).
  • bun run typecheck clean. bun run verify green.
  • Verified end-to-end on a brain that ingests both claude-code and voice-note sources: every persisted page lands with the correct transcript_source label and a date that matches the conversation rather than the file metadata.

Coordination with the synthesize hardening pack

The feat(dream): synthesize backfill loop hardening PR (separate branch feat/dream-synthesize-hardening) also ships the orchestrator-owned frontmatter commit. Per "duplicate commits across branches are fine, the second PR's duplicate becomes a no-op," whichever PR merges second sees that commit auto-resolve. Both PRs are defensible as single-topic: this PR is "transcript metadata pipeline," the other is "synthesize backfill loop hardening."

Why one PR for three changes

All three live in the transcript-discovery → synthesize dual-write path. All three reproduce together on any brain that ingests transcripts from multiple sources or whose file timestamps drift relative to conversation start. Splitting would create a three-PR chain with an interleaved-merge dependency on shipment order (the orchestrator merge needs transcriptSource to stamp it). Bundling them as "the transcript metadata pipeline" lands the same code with one review surface.

Discovery now reports which archive the transcript came from by reading the directory two levels above the file (the parent of the date directory). The canonical claude-code-archive layout is `<corpus>/<source>/<date>/<id>.md`, so a file at `~/.gbrain/transcripts/claude-code/2026-05-19/abc.md` yields `transcriptSource: 'claude-code'`. Returns null when the path doesn't match a `<source>/<date>/` pair, which keeps ad-hoc `--input` runs from inventing a source name.

Source name validation is intentionally strict: lowercase alphanumeric with hyphen separators (matches the existing slug-segment regex). Anything that fails the check returns null rather than a half-baked guess.

Step one of the orchestrator-side-frontmatter work. The downstream synthesize phase will read this to write `transcript_source` into reflection / original frontmatter directly, instead of forcing the Sonnet subagent to invent a source label (which has been drifting: one reflection said `claude-code transcript <uuid>`, another just said `<uuid>` with no archive hint).

Test fixtures updated to include the new required field.
Stop asking the Sonnet subagent to invent frontmatter and have the orchestrator stamp every deterministic field directly. The subagent retains authority over `title` and `tags` because those genuinely need its judgment. Everything else — date, effective_date, transcript_source, transcript_id, transcript_hash, chunk, dream_generated, dream_cycle_date — is computed from the per-child state the orchestrator already tracks for chunk-slug rewriting.

The drift this fixes was unambiguous. Three reflections from one cycle:

```
2026-05-21-verbatim-consistency-over-clever-recasts-558bf0.md   no effective_date, no transcript_hash
2026-06-02-docs-discipline-as-release-surface-f1a913.md         transcript_hash present, no effective_date
2026-06-02-rebase-permission-boundary-97ed24.md                 effective_date present, transcript_hash_suffix instead of transcript_hash
```

Three reasonable choices the subagent could make, three incompatible shapes downstream. Queries that key on `effective_date` or `transcript_hash` silently missed rows.

What changed:

- `chunkInfo` (Map<jobId, {idx, hash6}>) renamed to `childMeta` and populated for every child, not just chunked ones. Carries idx, hash6, chunkTotal, transcriptSource, transcriptId, inferredDate.
- `collectChildPutPageSlugs` now returns `{ slug, source_id, jobId }` so the dual-write can pair each put_page back to the child that produced it.
- New `buildDeterministicFrontmatter(meta, cycleDate)` returns the override map.
- `reverseWriteRefs` now:
  - looks up the meta per slug via job_id,
  - persists the merged frontmatter back to the DB row via `engine.putPage` (so search index, doctor `effective_date_health`, and any consumer that queries frontmatter see the same fields as disk), and
  - threads the overrides into `renderPageToMarkdown` for the disk write.
- `renderPageToMarkdown` gains an `extraOverrides` parameter; the existing identity stamp stays in place and the orchestrator's per-page values land after it.
- `stripContentVersionSuffix` collapses claude-code-archive's `<uuid>--<contentHash>.md` form to the bare session UUID so edited transcripts share a `transcript_id` instead of inflating to one id per edit. Source-of-truth content version stays addressable via `transcript_hash`.

`type` and `title` are still owned by the subagent's put_page because `serializePageToMarkdown` reads them off the Page meta record rather than from the override map. The subagent has been picking those correctly across cycles, so leave that contract alone.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant