The Podcast Knowledge Graph
March 2026 — Revised: Person Page MVP Focus
People appear on dozens of podcasts, sharing ideas, recommendations, stories, and opinions. This content is scattered, unsearchable, and ephemeral. There is no way to get a holistic view of a person through their own words across all their podcast appearances.
PodGraph ingests, transcribes, and analyzes podcast interviews to build rich, AI-generated profiles of people. Every profile is constructed exclusively from the person's own words. The platform surfaces connections between people through shared ideas, mutual references, and common topics.
- Person-centric profiles: Everything a person has said across all transcribed podcast interviews, organized into a single page.
- First-party data only: All information comes directly from the person's own words. No Wikipedia, no third-party bios.
- Contextual connection discovery: Discover lesser-known guests through shared specific interests, recommendations, and opposing viewpoints — surfaced inline at the point of interest.
- Browsable categories: Explore people through occupation, interests, hobbies, recommended books, and more.
The person page is the single most important surface in PodGraph. It is a comprehensive research tool where aggregated opinions, detailed anecdotes, cross-episode synthesis, and conviction-ranked positions live. The product's value is measured by how well this page answers: who is this person, what do they believe, and how has their thinking evolved?
There are no episode pages. Episodes are internal data units — they are ingested, transcribed, and extracted from, but they do not have their own public-facing route. All extracted content is surfaced through person pages, where it gains meaning through aggregation and cross-referencing. Individual episode data (podcast name, title, date, theme highlights) appears in the Podcast Appearances section of each person's profile, providing enough context to identify and find the original episode.
Connections are surfaced contextually — inline on person pages, at the exact point where a reader's interest is highest. The connection scoring engine powers these contextual cards, but there is no standalone graph destination. Network graphs produce low engagement and little real discovery. Contextual surfacing at the point of interest is more effective.
The extraction pipeline has been validated through prototype development. Five scripts run in sequence, each reading the previous script's output. Any script can be re-run independently.
| Step | Script |
|---|---|
| 1. Transcription | transcribe.ts — Deepgram nova-3, speaker diarization |
| 2. Transcript Correction | correct-transcript.ts — Claude-powered proper noun correction |
| 3. Speaker Identification | identify-speakers.ts — Claude maps speaker labels to real names |
| 4. Extraction | extract.ts — Structured theme-based extraction with Zod validation |
| 5. Entity Registry | update-registry.ts — Merges entities into global registry |
| Layer | Technology | Rationale |
|---|---|---|
| Frontend | Next.js 14+ (App Router) | SSR/SSG for SEO, React Server Components |
| Styling | Tailwind CSS + shadcn/ui | Rapid development, consistent design system |
| Backend API | Next.js API Routes + tRPC | Type-safe, co-located with frontend |
| Database | PostgreSQL (Supabase/Neon) | Relational + JSONB + full-text search |
| Job Queue | BullMQ + Redis | Background job processing for pipelines |
| Transcription | Deepgram API (nova-3) | High accuracy, native diarization |
| AI Extraction | Anthropic Claude API (Sonnet) | Structured extraction, summarization, aggregation |
| File Storage | S3-compatible (AWS/R2) | Audio files, transcript archives |
| Deployment | Vercel + Railway/Fly.io | Serverless frontend, persistent workers |
The person page is the product's primary surface. It must feel like a coherent narrative about a person, not a database dump organized into tabs. Every section earns its place by answering: does this help someone understand who this person is and what they believe?
Name, aggregated self-described roles, total appearances, date range of appearances. Roles come only from how the person introduces themselves across episodes — never external sources.
A 2–3 paragraph narrative synthesizing what this person cares about and believes, based on recurring themes and strongly held positions across all appearances. This is not a biography. It does not start with "[Name] is a..." It starts with what makes their perspective interesting or distinctive. Written in third person, grounded in their actual words, with inline episode references.
Example framing: "Across 14 appearances, [Name] consistently argues that most people's lives are constrained by fear of reversible decisions, and that systematic self-experimentation is undervalued relative to expert advice."
Organized by conviction strength, not topic frequency. Someone who passionately argues a contrarian view on one podcast is more interesting than someone who casually mentions AI on twelve. Each position includes the claim synthesis, the supporting quote (expandable), and the source episode.
Evolution tracking: When a person's position on a topic changes across appearances over time, both positions are shown with dates, creating a visible arc. Contradictions and shifts are flagged, not hidden. This is a genuinely unique feature — nobody else does this.
At each conviction or theme, small cards surface other profiled people with relevant perspectives on the same topic. Three types of connection cards:
- "Also spoke extensively about this" — A lesser-known guest who went deep on the same specific topic. Shows their strongest quote on it.
- "Disagrees on this" — Someone who holds an opposing view on the specific topic. Both positions shown side by side.
- "Recommended by [Name]" — Asymmetric mentions. If this person name-drops and praises someone lesser-known, that's a strong discovery signal.
Discovery happens at the point of interest, not as a separate destination.
Books, tools, movies, music — not as flat lists, but clustered by what they reveal about how the person thinks. If someone recommends five stoic philosophy books and three systems thinking books, that pattern is the feature. Each recommendation cites its source episode. Deduplicated across appearances, with frequency noted when mentioned on multiple podcasts.
For each person, identify the 2–4 most specific topics where they have deep-dive depth across multiple appearances. These render as prominent badges near the header. For lesser-known guests, these badges are the primary identity signal — they tell a visitor immediately whether this person's perspective is relevant to them, regardless of name recognition.
Aggregated across all appearances with context and sentiment. When a mentioned person also has a PodGraph profile, link directly to their page.
Chronological list of episodes this person appeared on. Each entry shows podcast name, episode title, date, and the 2–3 strongest themes from that specific appearance. This serves as the index for finding original episodes — it provides enough context to locate and listen to any episode without needing a dedicated episode page.
- No network graph visualization. Connections are surfaced contextually inline, not as a separate graph view.
- No external biographical data. Everything comes from the person's own words in transcribed interviews.
- No episode pages. Episodes are data sources, not destinations. Episode-specific content is accessed through person pages and the Podcast Appearances section.
- No AI chat. Conversational AI interface is deferred to a future phase. The person page itself is the research tool.
Organized around themes rather than flat data-type arrays. Enables clean aggregation for person pages.
| Field | Description |
|---|---|
name: string |
Theme name (e.g., "CUDA as Strategic Foundation") |
depth: enum |
mentioned | discussed | deep_dive |
description: string |
1–2 sentence teaser that creates curiosity, not a comprehensive summary |
best_quote: object (optional) |
Single best quote: text, context, timestamp_start, timestamp_end |
opinion: object (optional) |
Speaker's position + supporting quote (only when a clear position was taken) |
anecdote: object (optional) |
Personal story summary + timestamp (only when a specific story was shared) |
related_people: string[] |
Names referenced in this theme's discussion |
related_companies: string[] |
Companies referenced in this theme's discussion |
| Field | Description |
|---|---|
role: string |
host | guest | cohost | unknown |
self_description: array |
Professional roles and titles only |
themes: array |
8–10 theme objects organized by depth |
books: array |
Title, author, context, recommended boolean |
movies_tv: array |
Title, context, recommended boolean |
music: array |
Name, context, recommended boolean |
tools_products: array |
Name, context, relationship (created | recommended | used | mentioned) |
people_mentioned: array |
Name, slug, context, sentiment |
companies_orgs: array |
Name, slug, context, sentiment |
- Companies separated from people in the extraction prompt. Without this, companies end up in
people_mentioned. - Relationship types on tools/products. Distinguishes created vs recommended vs used vs mentioned.
- Self-description constrained to roles. Professional titles only, not health conditions or personality traits.
- Quote coherence filtering. Skip quotes with repeated words, false starts, or fragments.
These database entities extend the extraction schema with persistence and relationships.
| Field | Description |
|---|---|
id: UUID |
Primary key |
title: VARCHAR |
Podcast name |
rss_url: VARCHAR |
RSS feed URL for automated ingestion |
artwork_url: VARCHAR |
Podcast cover art |
host_person_id: UUID (FK) |
Link to the host's Person record |
description: TEXT |
Podcast description from RSS |
platform_links: JSONB |
URLs to Spotify, Apple Podcasts, YouTube, etc. |
episode_count: INT |
Total episodes ingested |
Episodes exist as database records for pipeline processing and as source references on person pages. They do not have their own public-facing page.
| Field | Description |
|---|---|
id: UUID |
Primary key |
podcast_id: UUID (FK) |
Parent podcast |
title: VARCHAR |
Episode title |
audio_url: VARCHAR |
Direct audio URL |
published_at: TIMESTAMP |
Original publish date |
duration_seconds: INT |
Episode length |
status: ENUM |
pending | transcribing | correcting | processing | complete | failed |
transcript_raw: TEXT |
Full verbatim transcript with speaker labels + timestamps |
themes: JSONB |
Array of extracted themes with nested quotes, opinions, anecdotes |
guest_person_ids: UUID[] |
Links to guest Person records |
| Field | Description |
|---|---|
id: UUID |
Primary key |
name: VARCHAR |
Full name (canonical) |
slug: VARCHAR |
URL-friendly identifier |
aliases: VARCHAR[] |
Alternative names and nicknames |
roles: JSONB |
Roles from self_description across all appearances |
worldview_summary: TEXT |
AI-generated narrative of beliefs and recurring themes — not a biography |
convictions: JSONB |
Ranked positions with conviction strength, supporting quotes, episode refs, and evolution timestamps |
deep_on_topics: VARCHAR[] |
2–4 most specific topics with deep-dive depth across multiple appearances |
taste_clusters: JSONB |
Recommendations clustered by pattern (e.g., stoic philosophy, systems thinking) |
themes_aggregated: JSONB |
Merged themes across all appearances with frequency |
interests: JSONB |
Books, movies, music, hobbies mentioned |
people_mentioned: JSONB |
People discussed with context and sentiment |
episode_appearances: UUID[] |
All episodes this person appears in |
appearance_count: INT |
Total episode appearances |
| Field | Description |
|---|---|
id: UUID |
Primary key |
person_a_id / person_b_id: UUID (FK) |
The two connected people |
connection_type: ENUM |
shared_interest | mutual_mention | same_episode | shared_recommendation | shared_topic | opposing_position |
strength: FLOAT |
Connection strength score (0–1), normalized |
details: JSONB |
Specifics: which books, topics, quotes create this link |
evidence_episodes: UUID[] |
Episodes where this connection was detected |
| Field | Description |
|---|---|
slug: VARCHAR |
Primary key (e.g., "peter-attia") |
entity_type: ENUM |
person | company |
canonical_name: VARCHAR |
Display name |
aliases: VARCHAR[] |
Alternative names/spellings |
episodes: UUID[] |
All episodes where this entity is mentioned |
The scoring engine powers inline contextual cards. Scoring prioritizes specificity and meaningful overlap over surface-level topic matches.
| Signal | Score |
|---|---|
| Co-appearance on same episode | +0.5 |
| Mutual mention (A mentions B AND B mentions A) | +0.4 |
| Shared specific book recommendation | +0.3 per book |
| Opposing positions on same specific topic | +0.3 per topic |
| Shared movie/music recommendation | +0.2 per item |
| One-way mention (especially praise of lesser-known person) | +0.15 |
| Shared specific topic (not generic) | +0.1 per topic |
| Final score | Normalized to 0–1 range |
Every episode goes through a five-stage processing pipeline. Each stage is independent and can be re-run without re-running earlier stages.
Download audio (via yt-dlp for YouTube), send to Deepgram nova-3 with speaker diarization, utterance grouping, and smart formatting. Output: transcript with speaker labels, timestamps, and utterance text.
Apply global corrections file first (cheap, instant), then send transcript sample + episode metadata to Claude for remaining proper noun errors. Append new corrections to global file so the system improves automatically. Use whole-word, case-insensitive matching to avoid corrupting substrings.
Send transcript excerpt + platform metadata to Claude. Map generic speaker labels to real names with confidence scores. Episode metadata (especially YouTube titles) is the strongest signal. Low-confidence identifications (< 0.7) are labeled "Unknown Speaker" rather than guessed.
Send full transcript to Claude with structured prompt requesting theme-based extraction. Validate response with Zod. Retry with schema feedback if validation fails. For transcripts exceeding 300K characters, split into overlapping chunks, extract independently, then merge with topic deduplication.
Read people_mentioned and companies_orgs from extraction. For each entity: if new, add to registry; if existing, append episode reference and merge aliases.
When a new episode is processed, the person page aggregation runs for each identified speaker. This is incremental — it updates existing person data rather than reprocessing everything.
For each theme with an opinion field, extract the position, strength of conviction (based on language intensity, repetition, and depth of argument), and the supporting quote. Compare against existing convictions for this person. If a new position contradicts an existing one, flag the evolution and preserve both with timestamps.
Using all convictions and deep-dive themes across appearances, generate the worldview summary narrative. Re-generated when significant new data arrives (a new episode that introduces new themes or changes positions).
Identify topics where the person has deep-dive depth across multiple appearances. Requires at least 2 episodes with depth: deep_dive on overlapping themes. Surface-level mentions across many episodes do not qualify.
Cluster recommendations (books, tools, movies, music) by pattern. Use Claude to identify thematic groupings rather than simple category matching. The clusters should reveal something about how the person thinks.
For each person, generate connection card data for inline display. Query the PersonConnection table for high-scoring connections, then format into the three card types: also-spoke-about, disagrees-on, and recommended-by. Prioritize lesser-known guests when the current person is well-known.
| Route | Page | Description |
|---|---|---|
/ |
Home / Explore | Featured people, trending topics, search, category browsing |
/person/[slug] |
Person Profile | Full aggregated profile with inline contextual connections |
/podcast/[slug] |
Podcast Page | Podcast info, all processed episodes, host profile link |
/explore/[category] |
Category Browse | Browse by occupation, interest, hobby, etc. |
/search |
Search | Full-text search across people, topics, quotes |
/episode/[id]— No episode pages. Episode data is accessed through person profiles and podcast pages./person/[slug]/chat— AI chat is deferred to a future phase./person/[slug]/connections— Connection data is embedded inline on the person profile page./connections— No standalone connection visualization.
The podcast page shows podcast metadata (title, description, artwork, platform links) and a chronological list of all processed episodes with guest names and theme highlights. Each episode entry links to the guest's person profile. This is the closest thing to an episode index — it helps users find episodes, but the content lives on person pages.
Core extraction pipeline validated end-to-end with real episodes. Five pipeline stages working: transcription, correction, speaker identification, extraction, and entity registry. The prototype also included a static HTML page generator (build-page.ts) for validation purposes; this script is retired and not carried forward into the application.
Goal: Move from scripts to a real application with persistent storage.
| Task | Priority | Details |
|---|---|---|
| Next.js app scaffolding | P0 | Next.js 14, TypeScript, Tailwind, shadcn/ui, PostgreSQL with Prisma |
| Database schema | P0 | Core tables: Podcast, Episode, Person, PersonConnection, EntityRegistry |
| Migrate pipeline to app | P0 | Pipeline scripts become src/lib/pipeline/ and src/lib/ai/ modules |
| Manual episode ingestion | P0 | Admin form to submit episode URL, triggers pipeline |
| Basic person list page | P0 | Simple index of all profiled people with name, roles, appearance count |
| Basic search | P1 | Full-text search over Person names and themes |
Goal: Build the person page as defined in Section 3. This is the primary focus of the roadmap.
| Task | Priority | Details |
|---|---|---|
| Person page aggregation pipeline | P0 | Implement the 5-step aggregation described in Section 5.2 |
| Worldview summary generation | P0 | Claude-generated narrative of beliefs and recurring themes |
| Conviction extraction & ranking | P0 | Extract positions, rank by strength, detect evolution over time |
| Person profile page (React) | P0 | Full page layout with all sections from Section 3.1 |
| Deep-on badge identification | P0 | Identify and render 2–4 deep-dive topics per person |
| Taste clustering | P1 | Cluster recommendations by pattern using Claude |
| Contextual connection cards | P1 | Inline connection cards: also-spoke-about, disagrees-on, recommended-by |
Goal: Add browsing, category exploration, and frontend polish.
| Task | Priority | Details |
|---|---|---|
| Category browsing pages | P0 | Browse by occupation, interest, hobby, recommended books |
| Home / Explore page | P0 | Featured people, trending topics, recent episodes |
| Podcast page | P1 | Podcast info, all processed episodes, guest profile links |
| Responsive design | P1 | Full mobile responsiveness across all pages |
| Search enhancements | P1 | Search within quotes, filter by topic, date range, person |
Goal: Automate ingestion and prepare for growth.
| Task | Priority | Details |
|---|---|---|
| RSS feed auto-ingestion | P0 | Scheduled jobs to check RSS feeds and process new episodes |
| Admin dashboard | P0 | Manage podcasts, monitor pipeline status, review flagged IDs |
| Auth + user accounts | P1 | Registration, saved favorites, custom collections |
| Performance optimization | P1 | Caching, ISR for profile pages, lazy loading, pagination |
Every piece of information on a person's profile must originate from that person's own words during a transcribed podcast interview. No Wikipedia, no LinkedIn, no external APIs. The worldview summary is synthesized from what the person has said. Roles come from how the person introduces themselves.
The person page is comprehensive, opinionated about what matters (conviction-ranked, not alphabetical), and reveals the person's intellectual arc over time. The litmus test: after reading the person page, does the reader feel they understand this person's perspective? If not, the aggregation is too shallow.
Episodes are ingested and processed but do not have their own public-facing route. All extracted content is surfaced through person pages. This keeps the product focused: the value is in the person-level synthesis, not in episode-level summaries. Episode metadata appears in two places: the Podcast Appearances section of person profiles, and the episode list on podcast pages.
Connections are not a destination — they are a discovery mechanism embedded at the point of interest. The most valuable moment to suggest a lesser-known guest is when the reader is already engaged with a topic that guest spoke deeply about.
Incorrect speaker attribution corrupts the entire dataset. The pipeline is conservative: uncertain identifications are flagged rather than guessed. Episode metadata (title typically names the guest) is the strongest signal.
Two people both mentioning "coffee" is not a connection. Two people both recommending the same specific book is. Connection strength is weighted by specificity. Opposing views on the same specific topic score equally to shared recommendations.
podgraph/
scripts/
transcribe.ts
correct-transcript.ts
identify-speakers.ts
extract.ts
update-registry.ts
prompts/
extraction.txt
speaker-id.txt
correct-names.txt
lib/
schemas.ts
data/
transcript.json
transcript-corrected.json
transcript-identified.json
extraction.json
corrections-global.json
entities.json
podgraph/
prisma/
schema.prisma
src/
app/
person/[slug]/page.tsx
podcast/[slug]/page.tsx
explore/
search/page.tsx
admin/
components/
ui/
person/
layout/
lib/
db.ts
ai/
extraction.ts
summarization.ts
correction.ts
aggregation.ts
pipeline/
ingestion.ts
transcription.ts
speaker-id.ts
aggregation.ts
connections.ts
registry.ts
workers/
transcription.worker.ts
extraction.worker.ts
aggregation.worker.ts
src/lib/ai/aggregation.ts— Worldview synthesis, conviction ranking, taste clusteringsrc/lib/pipeline/connections.ts— Connection scoring engine (powers inline cards)src/workers/aggregation.worker.ts— Background job for person page aggregation after new episode processing
scripts/build-page.ts— Static HTML episode page generator. No longer needed.output/— Generated HTML directory. No longer needed.
- AI chat / RAG system — Theme-aware transcript chunking, vector embeddings, and conversational AI interface are deferred to a future phase. No TranscriptChunk table, no pgvector, no embeddings infrastructure in the current scope.
| Variable | Purpose |
|---|---|
DATABASE_URL |
PostgreSQL connection string |
REDIS_URL |
Redis connection for BullMQ job queue |
ANTHROPIC_API_KEY |
Claude API for extraction, correction, summarization, and aggregation |
ANTHROPIC_EXTRACT_MODEL |
Model override for extraction (defaults to claude-sonnet-4-5) |
DEEPGRAM_API_KEY |
Deepgram API for transcription (nova-3) |
S3_BUCKET / S3_ACCESS_KEY / S3_SECRET_KEY |
S3-compatible storage for audio and transcripts |
NEXTAUTH_SECRET / NEXTAUTH_URL |
Authentication configuration |
YT_DLP_PATH |
Optional: explicit path to yt-dlp binary |
- TypeScript strictly throughout. No
anytypes. - Prisma for all database operations.
- BullMQ for all background jobs. Workers run in separate processes.
- Server components by default in Next.js. Client components only when interactivity is needed.
- All AI prompts stored as template strings in dedicated files under
prompts/for easy iteration. - Shared Zod schemas in
lib/schemas.ts— never duplicate schema definitions across files.
- Conviction strength scoring must be explainable. Use language intensity, repetition across episodes, and depth of supporting argument as signals. Store the reasoning, not just the score.
- Evolution detection requires temporal awareness. Compare positions across episodes ordered by publish date. Flag when the same topic appears with a different position.
- Worldview summary must not read like a Wikipedia intro. Prompt must enforce: start with what makes this person's perspective distinctive, not with their job title.
- Taste clusters require semantic grouping. Use Claude to identify thematic patterns across recommendations, not simple category matching.
- Connection cards must prioritize lesser-known guests. When rendering inline connections on a well-known person's page, weight toward guests with fewer appearances. The whole point is discovery.
- Theme-based organization: Extract themes as the primary unit, with quotes/opinions/anecdotes nested within them. Maximum 8–10 themes per speaker.
- Quote quality filtering: Skip quotes with repeated words, false starts, or fragments.
- Entity separation: Output
people_mentionedandcompanies_orgsas distinct arrays. - Zod validation with retry: Use
safeParse()on Claude's response. If validation fails, send follow-up with Zod error asking for correction. - Global corrections file: Append new corrections after each transcript correction run so common errors are fixed before Claude is called on future episodes.
- Unit tests for extraction parsing and Zod validation.
- Integration tests for the full pipeline (ingestion → correction → identification → extraction → registry).
- Person page tests for aggregation pipeline: verify conviction ranking, evolution detection, and taste clustering produce sensible results across different data shapes.
- Litmus test: After reading the person page, does the reader feel they understand this person's perspective? If not, the aggregation is too shallow.
End of Roadmap — Ready for Implementation