Skip to content

Latest commit

 

History

History
553 lines (387 loc) · 29.2 KB

File metadata and controls

553 lines (387 loc) · 29.2 KB

PROJECT ROADMAP — PodGraph

The Podcast Knowledge Graph

March 2026 — Revised: Person Page MVP Focus


1. Product Vision

1.1 Problem Statement

People appear on dozens of podcasts, sharing ideas, recommendations, stories, and opinions. This content is scattered, unsearchable, and ephemeral. There is no way to get a holistic view of a person through their own words across all their podcast appearances.

1.2 Solution

PodGraph ingests, transcribes, and analyzes podcast interviews to build rich, AI-generated profiles of people. Every profile is constructed exclusively from the person's own words. The platform surfaces connections between people through shared ideas, mutual references, and common topics.

1.3 Core Value Propositions

  • Person-centric profiles: Everything a person has said across all transcribed podcast interviews, organized into a single page.
  • First-party data only: All information comes directly from the person's own words. No Wikipedia, no third-party bios.
  • Contextual connection discovery: Discover lesser-known guests through shared specific interests, recommendations, and opposing viewpoints — surfaced inline at the point of interest.
  • Browsable categories: Explore people through occupation, interests, hobbies, recommended books, and more.

1.4 Key Design Philosophy

The Person Page Is the Product

The person page is the single most important surface in PodGraph. It is a comprehensive research tool where aggregated opinions, detailed anecdotes, cross-episode synthesis, and conviction-ranked positions live. The product's value is measured by how well this page answers: who is this person, what do they believe, and how has their thinking evolved?

There are no episode pages. Episodes are internal data units — they are ingested, transcribed, and extracted from, but they do not have their own public-facing route. All extracted content is surfaced through person pages, where it gains meaning through aggregation and cross-referencing. Individual episode data (podcast name, title, date, theme highlights) appears in the Podcast Appearances section of each person's profile, providing enough context to identify and find the original episode.

Connections as Contextual Discovery, Not Graph Visualization

Connections are surfaced contextually — inline on person pages, at the exact point where a reader's interest is highest. The connection scoring engine powers these contextual cards, but there is no standalone graph destination. Network graphs produce low engagement and little real discovery. Contextual surfacing at the point of interest is more effective.


2. System Architecture

2.1 Proven Pipeline Architecture

The extraction pipeline has been validated through prototype development. Five scripts run in sequence, each reading the previous script's output. Any script can be re-run independently.

Step Script
1. Transcription transcribe.ts — Deepgram nova-3, speaker diarization
2. Transcript Correction correct-transcript.ts — Claude-powered proper noun correction
3. Speaker Identification identify-speakers.ts — Claude maps speaker labels to real names
4. Extraction extract.ts — Structured theme-based extraction with Zod validation
5. Entity Registry update-registry.ts — Merges entities into global registry

2.2 Tech Stack

Layer Technology Rationale
Frontend Next.js 14+ (App Router) SSR/SSG for SEO, React Server Components
Styling Tailwind CSS + shadcn/ui Rapid development, consistent design system
Backend API Next.js API Routes + tRPC Type-safe, co-located with frontend
Database PostgreSQL (Supabase/Neon) Relational + JSONB + full-text search
Job Queue BullMQ + Redis Background job processing for pipelines
Transcription Deepgram API (nova-3) High accuracy, native diarization
AI Extraction Anthropic Claude API (Sonnet) Structured extraction, summarization, aggregation
File Storage S3-compatible (AWS/R2) Audio files, transcript archives
Deployment Vercel + Railway/Fly.io Serverless frontend, persistent workers

3. Person Page Design

The person page is the product's primary surface. It must feel like a coherent narrative about a person, not a database dump organized into tabs. Every section earns its place by answering: does this help someone understand who this person is and what they believe?

3.1 Page Structure

Header

Name, aggregated self-described roles, total appearances, date range of appearances. Roles come only from how the person introduces themselves across episodes — never external sources.

Worldview Summary (Lead Section)

A 2–3 paragraph narrative synthesizing what this person cares about and believes, based on recurring themes and strongly held positions across all appearances. This is not a biography. It does not start with "[Name] is a..." It starts with what makes their perspective interesting or distinctive. Written in third person, grounded in their actual words, with inline episode references.

Example framing: "Across 14 appearances, [Name] consistently argues that most people's lives are constrained by fear of reversible decisions, and that systematic self-experimentation is undervalued relative to expert advice."

Convictions & Positions

Organized by conviction strength, not topic frequency. Someone who passionately argues a contrarian view on one podcast is more interesting than someone who casually mentions AI on twelve. Each position includes the claim synthesis, the supporting quote (expandable), and the source episode.

Evolution tracking: When a person's position on a topic changes across appearances over time, both positions are shown with dates, creating a visible arc. Contradictions and shifts are flagged, not hidden. This is a genuinely unique feature — nobody else does this.

Contextual Connections (Inline)

At each conviction or theme, small cards surface other profiled people with relevant perspectives on the same topic. Three types of connection cards:

  • "Also spoke extensively about this" — A lesser-known guest who went deep on the same specific topic. Shows their strongest quote on it.
  • "Disagrees on this" — Someone who holds an opposing view on the specific topic. Both positions shown side by side.
  • "Recommended by [Name]" — Asymmetric mentions. If this person name-drops and praises someone lesser-known, that's a strong discovery signal.

Discovery happens at the point of interest, not as a separate destination.

Taste Profile (Recommendations)

Books, tools, movies, music — not as flat lists, but clustered by what they reveal about how the person thinks. If someone recommends five stoic philosophy books and three systems thinking books, that pattern is the feature. Each recommendation cites its source episode. Deduplicated across appearances, with frequency noted when mentioned on multiple podcasts.

"Deep On" Badges

For each person, identify the 2–4 most specific topics where they have deep-dive depth across multiple appearances. These render as prominent badges near the header. For lesser-known guests, these badges are the primary identity signal — they tell a visitor immediately whether this person's perspective is relevant to them, regardless of name recognition.

People They Mention

Aggregated across all appearances with context and sentiment. When a mentioned person also has a PodGraph profile, link directly to their page.

Podcast Appearances

Chronological list of episodes this person appeared on. Each entry shows podcast name, episode title, date, and the 2–3 strongest themes from that specific appearance. This serves as the index for finding original episodes — it provides enough context to locate and listen to any episode without needing a dedicated episode page.

3.2 What Is NOT on the Person Page

  • No network graph visualization. Connections are surfaced contextually inline, not as a separate graph view.
  • No external biographical data. Everything comes from the person's own words in transcribed interviews.
  • No episode pages. Episodes are data sources, not destinations. Episode-specific content is accessed through person pages and the Podcast Appearances section.
  • No AI chat. Conversational AI interface is deferred to a future phase. The person page itself is the research tool.

4. Data Models

4.1 Extraction Schema (Validated in Prototype)

Organized around themes rather than flat data-type arrays. Enables clean aggregation for person pages.

Theme (Primary Organizing Unit)

Field Description
name: string Theme name (e.g., "CUDA as Strategic Foundation")
depth: enum mentioned | discussed | deep_dive
description: string 1–2 sentence teaser that creates curiosity, not a comprehensive summary
best_quote: object (optional) Single best quote: text, context, timestamp_start, timestamp_end
opinion: object (optional) Speaker's position + supporting quote (only when a clear position was taken)
anecdote: object (optional) Personal story summary + timestamp (only when a specific story was shared)
related_people: string[] Names referenced in this theme's discussion
related_companies: string[] Companies referenced in this theme's discussion

Speaker Data (Per-Speaker Extraction)

Field Description
role: string host | guest | cohost | unknown
self_description: array Professional roles and titles only
themes: array 8–10 theme objects organized by depth
books: array Title, author, context, recommended boolean
movies_tv: array Title, context, recommended boolean
music: array Name, context, recommended boolean
tools_products: array Name, context, relationship (created | recommended | used | mentioned)
people_mentioned: array Name, slug, context, sentiment
companies_orgs: array Name, slug, context, sentiment

Key Schema Decisions from Prototype Testing

  • Companies separated from people in the extraction prompt. Without this, companies end up in people_mentioned.
  • Relationship types on tools/products. Distinguishes created vs recommended vs used vs mentioned.
  • Self-description constrained to roles. Professional titles only, not health conditions or personality traits.
  • Quote coherence filtering. Skip quotes with repeated words, false starts, or fragments.

4.2 Database Entities

These database entities extend the extraction schema with persistence and relationships.

Podcast

Field Description
id: UUID Primary key
title: VARCHAR Podcast name
rss_url: VARCHAR RSS feed URL for automated ingestion
artwork_url: VARCHAR Podcast cover art
host_person_id: UUID (FK) Link to the host's Person record
description: TEXT Podcast description from RSS
platform_links: JSONB URLs to Spotify, Apple Podcasts, YouTube, etc.
episode_count: INT Total episodes ingested

Episode (Internal Data Entity — No Public Route)

Episodes exist as database records for pipeline processing and as source references on person pages. They do not have their own public-facing page.

Field Description
id: UUID Primary key
podcast_id: UUID (FK) Parent podcast
title: VARCHAR Episode title
audio_url: VARCHAR Direct audio URL
published_at: TIMESTAMP Original publish date
duration_seconds: INT Episode length
status: ENUM pending | transcribing | correcting | processing | complete | failed
transcript_raw: TEXT Full verbatim transcript with speaker labels + timestamps
themes: JSONB Array of extracted themes with nested quotes, opinions, anecdotes
guest_person_ids: UUID[] Links to guest Person records

Person

Field Description
id: UUID Primary key
name: VARCHAR Full name (canonical)
slug: VARCHAR URL-friendly identifier
aliases: VARCHAR[] Alternative names and nicknames
roles: JSONB Roles from self_description across all appearances
worldview_summary: TEXT AI-generated narrative of beliefs and recurring themes — not a biography
convictions: JSONB Ranked positions with conviction strength, supporting quotes, episode refs, and evolution timestamps
deep_on_topics: VARCHAR[] 2–4 most specific topics with deep-dive depth across multiple appearances
taste_clusters: JSONB Recommendations clustered by pattern (e.g., stoic philosophy, systems thinking)
themes_aggregated: JSONB Merged themes across all appearances with frequency
interests: JSONB Books, movies, music, hobbies mentioned
people_mentioned: JSONB People discussed with context and sentiment
episode_appearances: UUID[] All episodes this person appears in
appearance_count: INT Total episode appearances

PersonConnection

Field Description
id: UUID Primary key
person_a_id / person_b_id: UUID (FK) The two connected people
connection_type: ENUM shared_interest | mutual_mention | same_episode | shared_recommendation | shared_topic | opposing_position
strength: FLOAT Connection strength score (0–1), normalized
details: JSONB Specifics: which books, topics, quotes create this link
evidence_episodes: UUID[] Episodes where this connection was detected

EntityRegistry

Field Description
slug: VARCHAR Primary key (e.g., "peter-attia")
entity_type: ENUM person | company
canonical_name: VARCHAR Display name
aliases: VARCHAR[] Alternative names/spellings
episodes: UUID[] All episodes where this entity is mentioned

4.3 Connection Scoring

The scoring engine powers inline contextual cards. Scoring prioritizes specificity and meaningful overlap over surface-level topic matches.

Signal Score
Co-appearance on same episode +0.5
Mutual mention (A mentions B AND B mentions A) +0.4
Shared specific book recommendation +0.3 per book
Opposing positions on same specific topic +0.3 per topic
Shared movie/music recommendation +0.2 per item
One-way mention (especially praise of lesser-known person) +0.15
Shared specific topic (not generic) +0.1 per topic
Final score Normalized to 0–1 range

5. AI Processing Pipeline

5.1 Episode Processing Pipeline

Every episode goes through a five-stage processing pipeline. Each stage is independent and can be re-run without re-running earlier stages.

Stage 1: Audio Ingestion & Transcription

Download audio (via yt-dlp for YouTube), send to Deepgram nova-3 with speaker diarization, utterance grouping, and smart formatting. Output: transcript with speaker labels, timestamps, and utterance text.

Stage 2: Automated Transcript Correction

Apply global corrections file first (cheap, instant), then send transcript sample + episode metadata to Claude for remaining proper noun errors. Append new corrections to global file so the system improves automatically. Use whole-word, case-insensitive matching to avoid corrupting substrings.

Stage 3: Speaker Identification

Send transcript excerpt + platform metadata to Claude. Map generic speaker labels to real names with confidence scores. Episode metadata (especially YouTube titles) is the strongest signal. Low-confidence identifications (< 0.7) are labeled "Unknown Speaker" rather than guessed.

Stage 4: AI Extraction

Send full transcript to Claude with structured prompt requesting theme-based extraction. Validate response with Zod. Retry with schema feedback if validation fails. For transcripts exceeding 300K characters, split into overlapping chunks, extract independently, then merge with topic deduplication.

Stage 5: Entity Registry Update

Read people_mentioned and companies_orgs from extraction. For each entity: if new, add to registry; if existing, append episode reference and merge aliases.

5.2 Person Page Aggregation Pipeline

When a new episode is processed, the person page aggregation runs for each identified speaker. This is incremental — it updates existing person data rather than reprocessing everything.

Step 1: Conviction Extraction

For each theme with an opinion field, extract the position, strength of conviction (based on language intensity, repetition, and depth of argument), and the supporting quote. Compare against existing convictions for this person. If a new position contradicts an existing one, flag the evolution and preserve both with timestamps.

Step 2: Worldview Synthesis

Using all convictions and deep-dive themes across appearances, generate the worldview summary narrative. Re-generated when significant new data arrives (a new episode that introduces new themes or changes positions).

Step 3: Deep-On Identification

Identify topics where the person has deep-dive depth across multiple appearances. Requires at least 2 episodes with depth: deep_dive on overlapping themes. Surface-level mentions across many episodes do not qualify.

Step 4: Taste Clustering

Cluster recommendations (books, tools, movies, music) by pattern. Use Claude to identify thematic groupings rather than simple category matching. The clusters should reveal something about how the person thinks.

Step 5: Contextual Connection Card Generation

For each person, generate connection card data for inline display. Query the PersonConnection table for high-scoring connections, then format into the three card types: also-spoke-about, disagrees-on, and recommended-by. Prioritize lesser-known guests when the current person is well-known.


6. Frontend Pages

6.1 Page Map

Route Page Description
/ Home / Explore Featured people, trending topics, search, category browsing
/person/[slug] Person Profile Full aggregated profile with inline contextual connections
/podcast/[slug] Podcast Page Podcast info, all processed episodes, host profile link
/explore/[category] Category Browse Browse by occupation, interest, hobby, etc.
/search Search Full-text search across people, topics, quotes

Routes That Do Not Exist

  • /episode/[id] — No episode pages. Episode data is accessed through person profiles and podcast pages.
  • /person/[slug]/chat — AI chat is deferred to a future phase.
  • /person/[slug]/connections — Connection data is embedded inline on the person profile page.
  • /connections — No standalone connection visualization.

6.2 Podcast Page

The podcast page shows podcast metadata (title, description, artwork, platform links) and a chronological list of all processed episodes with guest names and theme highlights. Each episode entry links to the guest's person profile. This is the closest thing to an episode index — it helps users find episodes, but the content lives on person pages.


7. Implementation Phases

Phase 0: Pipeline Scripts (COMPLETE)

Core extraction pipeline validated end-to-end with real episodes. Five pipeline stages working: transcription, correction, speaker identification, extraction, and entity registry. The prototype also included a static HTML page generator (build-page.ts) for validation purposes; this script is retired and not carried forward into the application.

Phase 1: Foundation

Goal: Move from scripts to a real application with persistent storage.

Task Priority Details
Next.js app scaffolding P0 Next.js 14, TypeScript, Tailwind, shadcn/ui, PostgreSQL with Prisma
Database schema P0 Core tables: Podcast, Episode, Person, PersonConnection, EntityRegistry
Migrate pipeline to app P0 Pipeline scripts become src/lib/pipeline/ and src/lib/ai/ modules
Manual episode ingestion P0 Admin form to submit episode URL, triggers pipeline
Basic person list page P0 Simple index of all profiled people with name, roles, appearance count
Basic search P1 Full-text search over Person names and themes

Phase 2: Person Page MVP

Goal: Build the person page as defined in Section 3. This is the primary focus of the roadmap.

Task Priority Details
Person page aggregation pipeline P0 Implement the 5-step aggregation described in Section 5.2
Worldview summary generation P0 Claude-generated narrative of beliefs and recurring themes
Conviction extraction & ranking P0 Extract positions, rank by strength, detect evolution over time
Person profile page (React) P0 Full page layout with all sections from Section 3.1
Deep-on badge identification P0 Identify and render 2–4 deep-dive topics per person
Taste clustering P1 Cluster recommendations by pattern using Claude
Contextual connection cards P1 Inline connection cards: also-spoke-about, disagrees-on, recommended-by

Phase 3: Discovery & Polish

Goal: Add browsing, category exploration, and frontend polish.

Task Priority Details
Category browsing pages P0 Browse by occupation, interest, hobby, recommended books
Home / Explore page P0 Featured people, trending topics, recent episodes
Podcast page P1 Podcast info, all processed episodes, guest profile links
Responsive design P1 Full mobile responsiveness across all pages
Search enhancements P1 Search within quotes, filter by topic, date range, person

Phase 4: Scale & Automation

Goal: Automate ingestion and prepare for growth.

Task Priority Details
RSS feed auto-ingestion P0 Scheduled jobs to check RSS feeds and process new episodes
Admin dashboard P0 Manage podcasts, monitor pipeline status, review flagged IDs
Auth + user accounts P1 Registration, saved favorites, custom collections
Performance optimization P1 Caching, ISR for profile pages, lazy loading, pagination

8. Key Design Decisions & Constraints

8.1 First-Party Data Principle

Every piece of information on a person's profile must originate from that person's own words during a transcribed podcast interview. No Wikipedia, no LinkedIn, no external APIs. The worldview summary is synthesized from what the person has said. Roles come from how the person introduces themselves.

8.2 Person Page as Research Tool

The person page is comprehensive, opinionated about what matters (conviction-ranked, not alphabetical), and reveals the person's intellectual arc over time. The litmus test: after reading the person page, does the reader feel they understand this person's perspective? If not, the aggregation is too shallow.

8.3 No Episode Pages

Episodes are ingested and processed but do not have their own public-facing route. All extracted content is surfaced through person pages. This keeps the product focused: the value is in the person-level synthesis, not in episode-level summaries. Episode metadata appears in two places: the Podcast Appearances section of person profiles, and the episode list on podcast pages.

8.4 Connections as Inline Discovery

Connections are not a destination — they are a discovery mechanism embedded at the point of interest. The most valuable moment to suggest a lesser-known guest is when the reader is already engaged with a topic that guest spoke deeply about.

8.5 Speaker Attribution Accuracy

Incorrect speaker attribution corrupts the entire dataset. The pipeline is conservative: uncertain identifications are flagged rather than guessed. Episode metadata (title typically names the guest) is the strongest signal.

8.6 Connection Quality Over Quantity

Two people both mentioning "coffee" is not a connection. Two people both recommending the same specific book is. Connection strength is weighted by specificity. Opposing views on the same specific topic score equally to shared recommendations.


9. Project Folder Structure

9.1 Pipeline Scripts (Current — Working)

podgraph/
  scripts/
    transcribe.ts
    correct-transcript.ts
    identify-speakers.ts
    extract.ts
    update-registry.ts
  prompts/
    extraction.txt
    speaker-id.txt
    correct-names.txt
  lib/
    schemas.ts
  data/
    transcript.json
    transcript-corrected.json
    transcript-identified.json
    extraction.json
    corrections-global.json
    entities.json

9.2 Full Application (Target)

podgraph/
  prisma/
    schema.prisma
  src/
    app/
      person/[slug]/page.tsx
      podcast/[slug]/page.tsx
      explore/
      search/page.tsx
      admin/
    components/
      ui/
      person/
      layout/
    lib/
      db.ts
      ai/
        extraction.ts
        summarization.ts
        correction.ts
        aggregation.ts
      pipeline/
        ingestion.ts
        transcription.ts
        speaker-id.ts
        aggregation.ts
        connections.ts
        registry.ts
    workers/
      transcription.worker.ts
      extraction.worker.ts
      aggregation.worker.ts

Key Modules

  • src/lib/ai/aggregation.ts — Worldview synthesis, conviction ranking, taste clustering
  • src/lib/pipeline/connections.ts — Connection scoring engine (powers inline cards)
  • src/workers/aggregation.worker.ts — Background job for person page aggregation after new episode processing

Retired from Prototype

  • scripts/build-page.ts — Static HTML episode page generator. No longer needed.
  • output/ — Generated HTML directory. No longer needed.

Deferred

  • AI chat / RAG system — Theme-aware transcript chunking, vector embeddings, and conversational AI interface are deferred to a future phase. No TranscriptChunk table, no pgvector, no embeddings infrastructure in the current scope.

10. Environment Variables

Variable Purpose
DATABASE_URL PostgreSQL connection string
REDIS_URL Redis connection for BullMQ job queue
ANTHROPIC_API_KEY Claude API for extraction, correction, summarization, and aggregation
ANTHROPIC_EXTRACT_MODEL Model override for extraction (defaults to claude-sonnet-4-5)
DEEPGRAM_API_KEY Deepgram API for transcription (nova-3)
S3_BUCKET / S3_ACCESS_KEY / S3_SECRET_KEY S3-compatible storage for audio and transcripts
NEXTAUTH_SECRET / NEXTAUTH_URL Authentication configuration
YT_DLP_PATH Optional: explicit path to yt-dlp binary

11. Implementation Notes for Claude Code

11.1 General Principles

  • TypeScript strictly throughout. No any types.
  • Prisma for all database operations.
  • BullMQ for all background jobs. Workers run in separate processes.
  • Server components by default in Next.js. Client components only when interactivity is needed.
  • All AI prompts stored as template strings in dedicated files under prompts/ for easy iteration.
  • Shared Zod schemas in lib/schemas.ts — never duplicate schema definitions across files.

11.2 Person Page Aggregation Notes

  • Conviction strength scoring must be explainable. Use language intensity, repetition across episodes, and depth of supporting argument as signals. Store the reasoning, not just the score.
  • Evolution detection requires temporal awareness. Compare positions across episodes ordered by publish date. Flag when the same topic appears with a different position.
  • Worldview summary must not read like a Wikipedia intro. Prompt must enforce: start with what makes this person's perspective distinctive, not with their job title.
  • Taste clusters require semantic grouping. Use Claude to identify thematic patterns across recommendations, not simple category matching.
  • Connection cards must prioritize lesser-known guests. When rendering inline connections on a well-known person's page, weight toward guests with fewer appearances. The whole point is discovery.

11.3 Extraction & Correction Notes

  • Theme-based organization: Extract themes as the primary unit, with quotes/opinions/anecdotes nested within them. Maximum 8–10 themes per speaker.
  • Quote quality filtering: Skip quotes with repeated words, false starts, or fragments.
  • Entity separation: Output people_mentioned and companies_orgs as distinct arrays.
  • Zod validation with retry: Use safeParse() on Claude's response. If validation fails, send follow-up with Zod error asking for correction.
  • Global corrections file: Append new corrections after each transcript correction run so common errors are fixed before Claude is called on future episodes.

11.4 Testing Strategy

  • Unit tests for extraction parsing and Zod validation.
  • Integration tests for the full pipeline (ingestion → correction → identification → extraction → registry).
  • Person page tests for aggregation pipeline: verify conviction ranking, evolution detection, and taste clustering produce sensible results across different data shapes.
  • Litmus test: After reading the person page, does the reader feel they understand this person's perspective? If not, the aggregation is too shallow.

End of Roadmap — Ready for Implementation