feat: Add video/podcast transcription to fetch phase by StartupBros · Pull Request #30 · alexknowshtml/smaug

StartupBros · 2026-03-25T23:37:15Z

Summary

Add tiered transcript extraction to the fetch phase for video and podcast bookmarks:

yt-dlp captions — downloads existing subtitles, no audio processing needed
Whisper fallback — extracts audio and transcribes locally when no captions exist
Graceful placeholder — status: needs_transcript when tools aren't installed

Both yt-dlp and Whisper are optional — zero new required dependencies.

Architecture

Full transcripts are stored as separate files (.state/transcripts/{id}.txt) instead of being inlined in the pending JSON. A 215K char transcript (1-hour talk) takes 89 bytes in the JSON. During processing, Claude reads only the first ~20K characters for summarization. Files are cleaned up after processing.

Changes

src/processor.js — 6 new exports: findYtDlp, findWhisper, parseJson3Transcript, parseVttTranscript, fetchTranscriptContent, plus podcast URL classification fix and transcript file storage
src/config.js — Config keys: ytdlpPath, whisperPath, whisperModel, transcribeTimeouts
.claude/commands/process-bookmarks.md — transcribe action reads transcript files, creates rich knowledge files with key takeaways
README.md — Transcription docs, install instructions, config reference
test/ — 23 new tests, 2 new fixtures

Testing

76 tests pass, 0 failures. Covers tool detection, subtitle parsing (json3 + VTT), transcript extraction, podcast URL classification, and config defaults/overrides. Integration verified with YouTube, Vimeo, and SoundCloud content.

Add tiered transcript extraction to the fetch phase: 1. yt-dlp captions (fast, no audio processing) 2. Whisper audio transcription (fallback when no captions) 3. Graceful placeholder (when tools not installed) Full transcripts are stored as separate files in .state/transcripts/ to keep the pending JSON small — a 215K char transcript takes 89 bytes in JSON. Processing reads only the first 20K chars needed for summarization. New exports: findYtDlp, findWhisper, parseJson3Transcript, parseVttTranscript, fetchTranscriptContent Also fixes podcast URL classification (was falling through to article type, fetching useless SPA HTML). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

StartupBros force-pushed the feat/video-podcast-transcription branch from 84cc82e to 291d70f Compare March 26, 2026 00:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add video/podcast transcription to fetch phase#30

feat: Add video/podcast transcription to fetch phase#30
StartupBros wants to merge 1 commit intoalexknowshtml:mainfrom
StartupBros:feat/video-podcast-transcription

StartupBros commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

StartupBros commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

StartupBros commented Mar 25, 2026 •

edited

Loading