Source of truth: podgraph-roadmap-revised.md
- Transcription — Deepgram nova-3, speaker diarization, 600s timeout for long episodes
- Transcript correction — Claude-powered proper noun correction with global cache
- Speaker identification — Claude maps speaker labels to real names
- AI extraction — Structured theme-based extraction with Zod validation
- Entity registry —
update-registry.tsmerges entities intodata/entities.json - Lex Fridman fast path — Scrape pre-made transcripts from lexfridman.com
- Manifest tracking and cost estimation
- Duplicate episode detection — pipeline checks manifest before processing,
--forceto override - Direct audio URL support — downloads locally before Deepgram (fixes CDN blocking)
- Progress indicators — elapsed time and expected wait on all pipeline steps
- Relayed quote filtering — extraction prompt rejects quotes the speaker is repeating from others
- Static HTML page generator (
build-page.ts) — retired, not carried forward
-
add-podcast.ts— register podcast RSS feeds via iTunes Search API -
discover.ts— scan RSS feeds for guest appearances, skip host's own podcast -
data/podcasts.json— podcast registry with host info for exclusion logic
- Cross-episode data merging — themes, books, tools, people, companies deduplicated
- Semantic theme merging — Claude groups related themes across episodes (~$0.015)
- Conviction extraction & ranking — strength scored 1-10, evolution detection (~$0.053)
- Worldview synthesis — 2-3 paragraph narrative, not a bio (~$0.013)
- Deep-on badge identification — 2-4 signature deep-dive topics (~$0.011)
- Taste clustering — recommendations grouped by thematic pattern (~$0.027)
- Role deduplication — synonym groups + substring merging
-
extract-multipass.ts— Pass 1 (Haiku) entities, Pass 2 (Sonnet) themes with entity context. Now the default pipeline extraction. -
correct-quotes.ts— post-extraction quote correction against transcript utterances (no API calls) -
validate-extraction.ts— checks quote accuracy, entity cross-refs, entity classification -
extract-gemini.ts— Gemini A/B testing for extraction quality comparison -
export-prompt.ts— export resolved prompts for manual model testing in claude.ai
-
status.ts— per-person extraction/aggregation status -
costs.ts— cost summary across all episodes and profiles -
scripts/lib/costs.ts— per-step cost ledger (model, tokens, USD per pipeline step)
- Static HTML output with section nav index
- Merged Positions & Beliefs section — convictions + theme context unified, no duplication
- Topics Discussed — themes without opinions, separate section
- Collapsible sections —
<details>/<summary>, top 7 positions expanded, rest collapsed - Taste clusters — expandable cluster cards
- People mentioned — compact chip grid with expandable contexts
- Podcast appearances — clickable links to source audio
- Timestamp links — quotes link to exact moment in YouTube/audio
- Full episode attribution — podcast name + episode title on every quote
- No JavaScript — all CSS-only interactions
-
PODGRAPH_PIPELINE_GUIDE.md— full usage guide for all scripts, prompts, data files
- Re-extract all 12 Huberman episodes using 4-pass extraction (segmentation + entities parallel, theme synthesis from summaries, quote selection on targeted segments)
- Run
npm run correct-quotesafter re-extraction to verify quote accuracy - Run
npm run validateon each episode to check entity references
- Re-aggregate Huberman profile with 4-pass extractions: 12 episodes, 81 themes (31 cross-episode), 31 convictions, 130 tools
- Review the updated profile page for quality
- Update
CASE_STUDY.md— add multi-pass extraction, quote correction, validation, model optimization - Update
PODGRAPH_PIPELINE_GUIDE.md— multi-pass is now default, add new scripts (correct-quotes, validate, status, extract-multipass, extract-gemini) - Update
TODO.mdPhase 0.5 to reflect latest changes (multi-pass, quote correction, validation, cost tracking, Gemini A/B test)
- Profile a second person to test the pipeline beyond Huberman
- Add more podcasts to the registry (Modern Wisdom, Rich Roll, All-In, etc.)
Goal: Move from scripts to a real application with persistent storage.
- Next.js 14 app scaffolding — TypeScript, Tailwind, shadcn/ui
- PostgreSQL setup — Supabase or Neon
- Prisma schema — Podcast, Episode (internal, no public route), Person, PersonConnection, EntityRegistry
- BullMQ + Redis setup for background job queue
- tRPC setup for type-safe API routes
- Migrate
scripts/transcribe.ts→src/lib/pipeline/transcription.ts - Migrate
scripts/correct-transcript.ts→src/lib/ai/correction.ts - Migrate
scripts/identify-speakers.ts→src/lib/pipeline/speaker-id.ts - Migrate
scripts/extract.ts→src/lib/ai/extraction.ts - Migrate
scripts/update-registry.ts→src/lib/pipeline/registry.ts - Create BullMQ workers:
transcription.worker.ts,extraction.worker.ts,aggregation.worker.ts
- Admin form to submit episode URL → triggers pipeline
- Pipeline status monitoring in admin
- Basic person list page — name, roles, appearance count
- Basic full-text search over Person names and themes (P1)
Goal: Build the person page as a React/Next.js page (currently static HTML prototype).
- Connection card generation — also-spoke-about, disagrees-on, recommended-by
- Process more episodes per person to stress-test aggregation at scale
- Port static HTML profile to Next.js React components
- Header — name, aggregated self-described roles, appearance count, date range
- Worldview summary section
- Positions & beliefs — merged convictions + theme context
- Inline contextual connection cards
- Taste profile — clustered recommendations
- People they mention — compact grid with expandable contexts
- Podcast appearances — chronological with theme highlights
Goal: Add browsing, category exploration, and frontend polish.
- Category browsing pages — by occupation, interest, hobby, recommended books (P0)
- Home / Explore page — featured people, trending topics, recent episodes (P0)
- Podcast page — podcast info, all processed episodes, guest profile links (P1)
- Responsive design — full mobile responsiveness across all pages (P1)
- Search enhancements — search within quotes, filter by topic, date range, person (P1)
Goal: Automate ingestion and prepare for growth.
- RSS feed auto-ingestion — scheduled jobs to check RSS feeds and process new episodes (P0)
- Admin dashboard — manage podcasts, monitor pipeline status, review flagged IDs (P0)
- Auth + user accounts — registration, saved favorites, custom collections (P1)
- Performance optimization — caching, ISR for profile pages, lazy loading, pagination (P1)