CLAUDE.md — PodGraph

What is this project?

PodGraph is a web app that ingests podcast interviews, transcribes them, and builds AI-generated profiles of people entirely from their own words. It connects people through shared ideas, mutual references, and overlapping interests — surfaced as inline contextual cards on person pages, not as a standalone graph visualization.

The person page is the product. It is a comprehensive research tool where aggregated opinions, detailed anecdotes, cross-episode synthesis, and conviction-ranked positions live. There are no episode pages — episodes are internal data units that feed into person profiles.

A working example of a single-episode guest extraction output is in kevin-rose.html (prototype-era static HTML, now retired in favor of the Next.js person page).

Tech stack

Frontend: Next.js 14+ (App Router), TypeScript, Tailwind CSS, shadcn/ui
Backend API: Next.js API Routes + tRPC (type-safe, co-located with frontend)
Database: PostgreSQL (Supabase or Neon) with Prisma ORM
Job queue: BullMQ + Redis
Transcription: Deepgram API (nova-3 with speaker diarization)
AI extraction & aggregation: Anthropic Claude API (Sonnet 4.5)
Auth: NextAuth.js or Clerk
Deployment: Vercel (frontend) + Railway/Fly.io (workers)

What is NOT in scope

No pgvector / embeddings — RAG chat is deferred to a future phase
No TranscriptChunk table — no vector search infrastructure
No D3.js / vis.js graph viz — connections are inline contextual cards

Project structure

Current: Pipeline Scripts (Working)

podgraph/
├── scripts/
│   ├── pipeline.ts                    # Full pipeline: transcribe → correct → identify → extract → update-registry
│   ├── transcribe.ts                  # Deepgram nova-3 with diarization
│   ├── correct-transcript.ts          # Claude-powered proper noun correction
│   ├── identify-speakers.ts           # Claude maps speaker labels to names
│   ├── extract.ts                     # Structured theme-based extraction (Zod validated)
│   ├── update-registry.ts             # Merge entities into global registry
│   ├── anthropic-cost.ts              # Cost estimation utility
│   ├── youtube-meta.ts                # YouTube metadata & yt-dlp integration
│   ├── test-keys.ts                   # Validate API keys
│   ├── lex/
│   │   ├── pipeline.ts                # Fast path for Lex Fridman transcripts
│   │   └── fetch-transcript.ts        # Scrape lexfridman.com transcripts
│   └── lib/
│       ├── dirs.ts                    # Directory resolution & slug generation
│       ├── manifest.ts                # Episode manifest management
│       └── schemas.ts                 # Zod schemas for extraction output
├── prompts/
│   ├── extraction.txt                 # Theme extraction prompt
│   ├── speaker-id.txt                 # Speaker identification prompt
│   └── correct-names.txt             # Proper noun correction prompt
├── data/
│   ├── episodes/                      # Per-episode data directories
│   ├── entities.json                  # Global entity registry
│   ├── corrections-global.json        # Accumulated name corrections
│   └── manifest.json                  # Processed episodes index
└── package.json

Target: Full Application

podgraph/
├── prisma/schema.prisma
├── src/
│   ├── app/
│   │   ├── person/[slug]/page.tsx     # Person profile (THE primary surface)
│   │   ├── podcast/[slug]/page.tsx    # Podcast info + episode list
│   │   ├── explore/                   # Category browsing
│   │   ├── search/page.tsx            # Full-text search
│   │   └── admin/                     # Episode ingestion, pipeline monitoring
│   ├── components/
│   │   ├── ui/                        # shadcn/ui
│   │   ├── person/                    # Profile sections, connection cards
│   │   └── layout/                    # App shell, navigation
│   ├── lib/
│   │   ├── db.ts                      # Prisma client
│   │   ├── ai/
│   │   │   ├── extraction.ts          # Theme extraction
│   │   │   ├── correction.ts          # Proper noun correction
│   │   │   ├── summarization.ts       # Worldview synthesis
│   │   │   └── aggregation.ts         # Conviction ranking, taste clustering
│   │   └── pipeline/
│   │       ├── ingestion.ts           # Audio download, episode creation
│   │       ├── transcription.ts       # Deepgram integration
│   │       ├── speaker-id.ts          # Speaker identification
│   │       ├── aggregation.ts         # Person page data aggregation
│   │       ├── connections.ts         # Connection scoring engine
│   │       └── registry.ts            # Entity registry management
│   └── workers/
│       ├── transcription.worker.ts
│       ├── extraction.worker.ts
│       └── aggregation.worker.ts
├── scripts/                           # Standalone CLI scripts (kept working)
├── prompts/                           # AI prompt templates
└── data/                              # Local pipeline data

Retired from Prototype

scripts/build-page.ts — Static HTML episode page generator. Replaced by Next.js person page.
output/ — Generated HTML directory. No longer needed.

The cardinal rule: first-party data only

Every piece of information on a Person profile comes exclusively from that person's own words in transcribed podcast interviews. No Wikipedia, no LinkedIn, no external bios. The only exception is the person's name and basic deduplication metadata. This constraint is non-negotiable.

Core data models

Extraction Schema (Validated in Prototype)

Primary unit is the Theme, with quotes/opinions/anecdotes nested within:

Theme: name, depth (mentioned/discussed/deep_dive), description, best_quote, opinion, anecdote, related_people, related_companies
Speaker Data: role, self_description, themes, books, movies_tv, music, games, tools_products, companies_orgs (with slug), people_mentioned (with slug)

Database Entities (Phase 1)

Six main entities: Podcast, Episode (internal, no public route), Person, PersonConnection, EntityRegistry

Key Person fields:

worldview_summary — AI-generated narrative of beliefs and recurring themes (not a biography)
convictions — Ranked positions with conviction strength, evolution timestamps
deep_on_topics — 2-4 most specific deep-dive topics across multiple appearances
taste_clusters — Recommendations clustered by pattern
themes_aggregated — Merged themes across all appearances

Key design decisions:

Episodes have NO public route. All content surfaces through person pages.
PersonConnection powers inline contextual cards, not a graph visualization.
EntityRegistry tracks canonical names, aliases, and cross-episode references.

AI processing pipeline

Episode Pipeline (5 stages)

Each stage is independent and retriable:

Transcription — transcribe.ts: Deepgram nova-3, speaker diarization, utterance grouping
Transcript correction — correct-transcript.ts: Global corrections file + Claude for remaining proper noun errors
Speaker identification — identify-speakers.ts: Claude maps speaker labels to real names with confidence scores
AI extraction — extract.ts: Structured theme-based extraction, Zod validated, chunked for long transcripts
Entity registry update — update-registry.ts: Merge people/companies into global registry with alias detection

Person Page Aggregation Pipeline (Phase 2)

When a new episode is processed, runs for each identified speaker:

Conviction extraction — Extract positions, rank by strength, detect evolution over time
Worldview synthesis — Generate narrative from convictions and deep-dive themes
Deep-on identification — Find topics with deep-dive depth across 2+ appearances
Taste clustering — Cluster recommendations by thematic pattern using Claude
Connection card generation — Format high-scoring connections into inline card types

Connection scoring

Weighted sum, normalized to 0–1. Prioritizes specificity over surface-level overlap:

Signal	Score
Co-appearance on same episode	+0.5
Mutual mention (both directions)	+0.4
Shared specific book recommendation	+0.3 per book
Opposing positions on same specific topic	+0.3 per topic
Shared movie/music recommendation	+0.2 per item
One-way mention (especially praise of lesser-known)	+0.15
Shared specific topic (not generic)	+0.1 per topic

"Both mention coffee" is not a connection. "Both recommend the same specific book" is.

Person page design

The person page sections (see roadmap Section 3 for full detail):

Header — Name, aggregated self-described roles, appearance count
Worldview Summary — 2-3 paragraph narrative (NOT a biography, NOT "X is a...")
Convictions & Positions — Ranked by conviction strength, with evolution tracking
Contextual Connections (Inline) — "Also spoke about", "Disagrees on", "Recommended by"
Taste Profile — Recommendations clustered by pattern, not flat lists
"Deep On" Badges — 2-4 most specific deep-dive topics near header
People They Mention — Aggregated with context and sentiment
Podcast Appearances — Chronological list with 2-3 strongest themes per appearance

What is NOT on the person page

No network graph visualization
No external biographical data
No episode pages (episodes are data sources, not destinations)
No AI chat (deferred)

Frontend pages

Route	Page
`/`	Home / Explore — Featured people, trending topics, search
`/person/[slug]`	Person Profile — Full aggregated profile with inline connections
`/podcast/[slug]`	Podcast Page — Info, processed episodes, host profile link
`/explore/[category]`	Category Browse — By occupation, interest, hobby
`/search`	Search — Full-text across people, topics, quotes

Routes that do NOT exist: /episode/[id], /person/[slug]/chat, /connections

Implementation phases

Phase 0: Pipeline Scripts — COMPLETE

Core extraction pipeline validated. Five stages working. Static HTML generator (build-page.ts) retired.

Phase 1: Foundation

Next.js scaffolding, database schema (Prisma), migrate pipeline to app modules, manual episode ingestion admin form, basic person list page, basic search.

Phase 2: Person Page MVP

Person page aggregation pipeline, worldview summary, conviction extraction & ranking, full profile page with all sections, deep-on badges, taste clustering, contextual connection cards.

Phase 3: Discovery & Polish

Category browsing, home/explore page, podcast page, responsive design, search enhancements.

Phase 4: Scale & Automation

RSS auto-ingestion, admin dashboard, auth + user accounts, performance optimization.

Code conventions

TypeScript strictly throughout. No any types.
Prisma for all database operations.
BullMQ for all background jobs. Workers run in separate processes.
Server components by default in Next.js. Client components only when interactivity is needed.
All AI prompts stored as template strings in dedicated files under prompts/.
Shared Zod schemas in scripts/lib/schemas.ts (later src/lib/schemas.ts) — never duplicate schema definitions.
AI extraction prompts request structured JSON. Validate with Zod at runtime. Retry with schema feedback on failure.

Environment variables

See .env.example. Currently required: DEEPGRAM_API_KEY, ANTHROPIC_API_KEY. Phase 1 will add: DATABASE_URL, REDIS_URL, NEXTAUTH_SECRET/NEXTAUTH_URL.

Reference files

podgraph-roadmap-revised.md — Full architecture and implementation guide (the source of truth)
kevin-rose.html — Example of prototype-era static guest extraction output
TODO.md — Current task priorities

TODO list

Check off items in TODO.md as they are completed.

Case study (CASE_STUDY.md)

Maintain a portfolio case study about PodGraph at CASE_STUDY.md. Update it as the project evolves.

Structure:

Problem — Why podcast content is hard to search/reference, and what I wanted to build
Architecture decisions — Deepgram Nova-3 for transcription with speaker diarization, Claude API for structured extraction, the data pipeline design
Tradeoffs — What I considered and rejected, why I chose these tools over alternatives
Challenges — What was harder than expected and how I worked through it
Current state / what's next

Tone: Write like explaining this to a senior engineer over coffee. No tutorial voice, no fluff. Be specific about technical choices and reasoning. Keep it to ~800 words.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md — PodGraph

What is this project?

Tech stack

What is NOT in scope

Project structure

Current: Pipeline Scripts (Working)

Target: Full Application

Retired from Prototype

The cardinal rule: first-party data only

Core data models

Extraction Schema (Validated in Prototype)

Database Entities (Phase 1)

AI processing pipeline

Episode Pipeline (5 stages)

Person Page Aggregation Pipeline (Phase 2)

Connection scoring

Person page design

What is NOT on the person page

Frontend pages

Implementation phases

Phase 0: Pipeline Scripts — COMPLETE

Phase 1: Foundation

Phase 2: Person Page MVP

Phase 3: Discovery & Polish

Phase 4: Scale & Automation

Code conventions

Environment variables

Reference files

TODO list

Case study (CASE_STUDY.md)

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md — PodGraph

What is this project?

Tech stack

What is NOT in scope

Project structure

Current: Pipeline Scripts (Working)

Target: Full Application

Retired from Prototype

The cardinal rule: first-party data only

Core data models

Extraction Schema (Validated in Prototype)

Database Entities (Phase 1)

AI processing pipeline

Episode Pipeline (5 stages)

Person Page Aggregation Pipeline (Phase 2)

Connection scoring

Person page design

What is NOT on the person page

Frontend pages

Implementation phases

Phase 0: Pipeline Scripts — COMPLETE

Phase 1: Foundation

Phase 2: Person Page MVP

Phase 3: Discovery & Polish

Phase 4: Scale & Automation

Code conventions

Environment variables

Reference files

TODO list

Case study (CASE_STUDY.md)