Skip to content

Latest commit

 

History

History
171 lines (121 loc) · 7.83 KB

File metadata and controls

171 lines (121 loc) · 7.83 KB

PodGraph — TODO

Source of truth: podgraph-roadmap-revised.md


Phase 0: Pipeline Scripts (COMPLETE)

  • Transcription — Deepgram nova-3, speaker diarization, 600s timeout for long episodes
  • Transcript correction — Claude-powered proper noun correction with global cache
  • Speaker identification — Claude maps speaker labels to real names
  • AI extraction — Structured theme-based extraction with Zod validation
  • Entity registry — update-registry.ts merges entities into data/entities.json
  • Lex Fridman fast path — Scrape pre-made transcripts from lexfridman.com
  • Manifest tracking and cost estimation
  • Duplicate episode detection — pipeline checks manifest before processing, --force to override
  • Direct audio URL support — downloads locally before Deepgram (fixes CDN blocking)
  • Progress indicators — elapsed time and expected wait on all pipeline steps
  • Relayed quote filtering — extraction prompt rejects quotes the speaker is repeating from others
  • Static HTML page generator (build-page.ts) — retired, not carried forward

Phase 0.5: Discovery & Aggregation (COMPLETE)

Podcast Discovery

  • add-podcast.ts — register podcast RSS feeds via iTunes Search API
  • discover.ts — scan RSS feeds for guest appearances, skip host's own podcast
  • data/podcasts.json — podcast registry with host info for exclusion logic

Person Aggregation Pipeline (aggregate.ts)

  • Cross-episode data merging — themes, books, tools, people, companies deduplicated
  • Semantic theme merging — Claude groups related themes across episodes (~$0.015)
  • Conviction extraction & ranking — strength scored 1-10, evolution detection (~$0.053)
  • Worldview synthesis — 2-3 paragraph narrative, not a bio (~$0.013)
  • Deep-on badge identification — 2-4 signature deep-dive topics (~$0.011)
  • Taste clustering — recommendations grouped by thematic pattern (~$0.027)
  • Role deduplication — synonym groups + substring merging

Multi-Pass Extraction & Quality

  • extract-multipass.ts — Pass 1 (Haiku) entities, Pass 2 (Sonnet) themes with entity context. Now the default pipeline extraction.
  • correct-quotes.ts — post-extraction quote correction against transcript utterances (no API calls)
  • validate-extraction.ts — checks quote accuracy, entity cross-refs, entity classification
  • extract-gemini.ts — Gemini A/B testing for extraction quality comparison
  • export-prompt.ts — export resolved prompts for manual model testing in claude.ai

Monitoring & Cost Tracking

  • status.ts — per-person extraction/aggregation status
  • costs.ts — cost summary across all episodes and profiles
  • scripts/lib/costs.ts — per-step cost ledger (model, tokens, USD per pipeline step)

Person Profile Page (build-profile.ts)

  • Static HTML output with section nav index
  • Merged Positions & Beliefs section — convictions + theme context unified, no duplication
  • Topics Discussed — themes without opinions, separate section
  • Collapsible sections — <details>/<summary>, top 7 positions expanded, rest collapsed
  • Taste clusters — expandable cluster cards
  • People mentioned — compact chip grid with expandable contexts
  • Podcast appearances — clickable links to source audio
  • Timestamp links — quotes link to exact moment in YouTube/audio
  • Full episode attribution — podcast name + episode title on every quote
  • No JavaScript — all CSS-only interactions

Documentation

  • PODGRAPH_PIPELINE_GUIDE.md — full usage guide for all scripts, prompts, data files

Immediate Next Steps

Re-extract with 4-pass pipeline

  • Re-extract all 12 Huberman episodes using 4-pass extraction (segmentation + entities parallel, theme synthesis from summaries, quote selection on targeted segments)
  • Run npm run correct-quotes after re-extraction to verify quote accuracy
  • Run npm run validate on each episode to check entity references

Re-aggregate and rebuild

  • Re-aggregate Huberman profile with 4-pass extractions: 12 episodes, 81 themes (31 cross-episode), 31 convictions, 130 tools
  • Review the updated profile page for quality

Update documentation

  • Update CASE_STUDY.md — add multi-pass extraction, quote correction, validation, model optimization
  • Update PODGRAPH_PIPELINE_GUIDE.md — multi-pass is now default, add new scripts (correct-quotes, validate, status, extract-multipass, extract-gemini)
  • Update TODO.md Phase 0.5 to reflect latest changes (multi-pass, quote correction, validation, cost tracking, Gemini A/B test)

Process more people

  • Profile a second person to test the pipeline beyond Huberman
  • Add more podcasts to the registry (Modern Wisdom, Rich Roll, All-In, etc.)

Phase 1: Foundation

Goal: Move from scripts to a real application with persistent storage.

Scaffolding (P0)

  • Next.js 14 app scaffolding — TypeScript, Tailwind, shadcn/ui
  • PostgreSQL setup — Supabase or Neon
  • Prisma schema — Podcast, Episode (internal, no public route), Person, PersonConnection, EntityRegistry
  • BullMQ + Redis setup for background job queue
  • tRPC setup for type-safe API routes

Pipeline Migration (P0)

  • Migrate scripts/transcribe.tssrc/lib/pipeline/transcription.ts
  • Migrate scripts/correct-transcript.tssrc/lib/ai/correction.ts
  • Migrate scripts/identify-speakers.tssrc/lib/pipeline/speaker-id.ts
  • Migrate scripts/extract.tssrc/lib/ai/extraction.ts
  • Migrate scripts/update-registry.tssrc/lib/pipeline/registry.ts
  • Create BullMQ workers: transcription.worker.ts, extraction.worker.ts, aggregation.worker.ts

Admin & Ingestion (P0)

  • Admin form to submit episode URL → triggers pipeline
  • Pipeline status monitoring in admin

Basic Frontend (P0/P1)

  • Basic person list page — name, roles, appearance count
  • Basic full-text search over Person names and themes (P1)

Phase 2: Person Page MVP

Goal: Build the person page as a React/Next.js page (currently static HTML prototype).

Remaining Aggregation Work

  • Connection card generation — also-spoke-about, disagrees-on, recommended-by
  • Process more episodes per person to stress-test aggregation at scale

Person Profile Page — React (P0)

  • Port static HTML profile to Next.js React components
  • Header — name, aggregated self-described roles, appearance count, date range
  • Worldview summary section
  • Positions & beliefs — merged convictions + theme context
  • Inline contextual connection cards
  • Taste profile — clustered recommendations
  • People they mention — compact grid with expandable contexts
  • Podcast appearances — chronological with theme highlights

Phase 3: Discovery & Polish

Goal: Add browsing, category exploration, and frontend polish.

  • Category browsing pages — by occupation, interest, hobby, recommended books (P0)
  • Home / Explore page — featured people, trending topics, recent episodes (P0)
  • Podcast page — podcast info, all processed episodes, guest profile links (P1)
  • Responsive design — full mobile responsiveness across all pages (P1)
  • Search enhancements — search within quotes, filter by topic, date range, person (P1)

Phase 4: Scale & Automation

Goal: Automate ingestion and prepare for growth.

  • RSS feed auto-ingestion — scheduled jobs to check RSS feeds and process new episodes (P0)
  • Admin dashboard — manage podcasts, monitor pipeline status, review flagged IDs (P0)
  • Auth + user accounts — registration, saved favorites, custom collections (P1)
  • Performance optimization — caching, ISR for profile pages, lazy loading, pagination (P1)