People appear on dozens of podcasts sharing ideas, recommendations, and opinions. That content is scattered across platforms, locked inside audio, and essentially unsearchable. If you want to know what Andrew Huberman thinks about psychedelics, you'd have to listen to 14 episodes across Lex Fridman, Joe Rogan, and Tim Ferriss — probably 40+ hours — and mentally merge what he said across all of them.
I wanted to build the opposite experience: a single person page that synthesizes everything someone has said across all their podcast appearances, ranked by conviction strength, with every claim traceable to a specific moment in the audio.
The system is a five-stage episode processing pipeline followed by a multi-step aggregation pipeline, all running as TypeScript CLI scripts.
Episode pipeline: Audio URL goes in, structured JSON comes out. Deepgram Nova-3 handles transcription with speaker diarization — it's the best accuracy-to-cost ratio I found, and native diarization means I don't need a separate speaker separation step. The transcript then runs through Claude for three passes: proper noun correction (speech-to-text mangles names like "Ramoni Cajal" → "Ramón y Cajal"), speaker identification (mapping "Speaker 0" to real names using episode metadata as context), and structured extraction.
Extraction uses a multi-pass approach: Pass 1 (Haiku) extracts entities — people, companies, books, tools — cheaply and with high coverage. Pass 2 (Sonnet) handles themes, quotes, opinions, and anecdotes, with the entity list from Pass 1 as context so it doesn't waste capacity re-discovering names. Both passes validate against Zod schemas at runtime with retry on failure. After extraction, a programmatic quote correction step verifies every quote against the original transcript utterances by timestamp, replacing any that drifted during Claude's extraction. A separate validation script checks entity cross-references and quote accuracy.
Person aggregation: Once multiple episodes exist for a person, five Claude calls build the profile. Semantic theme merging groups related themes across episodes ("MDMA for PTSD" + "MDMA therapy sessions" become one unified theme). Conviction extraction ranks every opinion by strength on a 1-10 scale based on language intensity, cross-episode repetition, and depth of argument. Worldview synthesis generates a narrative summary. Taste clustering groups recommendations by pattern — not "books" and "supplements" as categories, but "Sleep Architecture Stack" and "Working-Class Punk Ethos" as thematic clusters that reveal how someone thinks.
Discovery: An RSS-based system registers podcast feeds via the iTunes API and scans them for guest appearances, automatically skipping podcasts where the person is the host.
Why Deepgram over Whisper? Whisper is free but I'd need to run inference infrastructure and handle diarization separately. Deepgram's Nova-3 at $0.26/hr gives me production-quality diarization out of the box. For a 3-hour JRE episode, that's ~$0.78 — worth it to avoid the ops overhead.
Why Claude for extraction instead of fine-tuned models? The extraction task requires nuanced judgment — distinguishing a speaker's own opinion from a quote they're relaying, deciding if a book mention is a genuine recommendation or a casual reference, identifying which themes are deep dives vs passing mentions. A fine-tuned model would need thousands of labeled examples I don't have. Claude handles it with prompt engineering and Zod validation, and I can iterate on the prompt without retraining.
Why theme-based extraction over flat arrays? Early prototypes extracted flat lists of quotes, books, and opinions. The problem was aggregation — you'd merge two episodes and get 20 disconnected quotes with no structure. Themes as the organizing unit meant each quote, opinion, and anecdote is contextualized within a topic, making cross-episode merging actually work.
Why not a database yet? The pipeline reads and writes JSON files. This is deliberate for the prototype phase — I can inspect every intermediate artifact, re-run any stage independently, and iterate without migrations. The roadmap has PostgreSQL + Prisma for Phase 1, but premature persistence would have slowed iteration when the schema was still changing weekly.
Speaker attribution accuracy was harder than expected. Deepgram's diarization is good but not perfect — it occasionally merges two speakers or splits one speaker into two tracks. The Claude-based speaker identification step uses episode metadata (YouTube titles almost always name the guest) as the primary signal, with a confidence threshold that labels uncertain identifications as "Unknown Speaker" rather than guessing wrong. Wrong attribution corrupts the entire dataset.
Quote quality required multiple iterations. Early extractions pulled garbled quotes with false starts, repeated words, and speech-to-text artifacts. The prompt now explicitly filters these. A subtler issue: speakers often relay other people's words ("my friend said, alcohol is borrowing happiness from tomorrow"). The extraction prompt now distinguishes own words from relayed quotes — only the speaker's original statements make it into the profile. Even with prompt-level rules, Claude sometimes paraphrases or merges utterances. The post-extraction quote correction script catches these by matching timestamps back to the raw transcript — no API calls, just programmatic comparison.
Cost management matters when each episode costs $0.80-2.40 to process. I built per-step cost tracking that persists to per-episode costs.json ledger files, with a costs command that aggregates spending across all episodes and profiles. The multi-pass extraction uses Haiku for entity extraction (~10x cheaper than Sonnet) and reserves Sonnet for the harder theme/opinion work. Host theme extraction is skipped (~30-40% output token savings). Aggregation uses Opus for conviction extraction and worldview synthesis where nuance matters, Sonnet for the rest. The full aggregation pipeline runs ~$0.12 per person. A Gemini extraction script exists for A/B testing extraction quality across providers.
The prototype processes real episodes end-to-end and generates static HTML person pages with collapsible sections, timestamp-linked quotes, and conviction-ranked positions. Andrew Huberman's profile across 12 podcast appearances is the primary test case — convictions ranked by strength, taste clusters, a worldview narrative, full podcast attribution with timestamp links, all from first-party transcript data only. The multi-pass extraction pipeline, post-extraction validation, and per-step cost tracking are all operational.
Next: re-extract existing episodes with the multi-pass pipeline, profile a second person to validate generalization, then port to Next.js with persistent storage and build the connection scoring engine for inline contextual discovery cards.