Generate SRT subtitles from audio/video with AI transcription, then refine, edit, and export — all in one place.
- AI Transcription: OpenAI Whisper API / Google Gemini API
- Multi-format: MP4, MP3, WAV, MOV, AVI, MKV, M4A, FLAC, OGG, WebM
- LLM Post-processing: 3 refine modes (Verbatim / Standard / Caption) with glossary
- Verify Pass: Full-text consistency check for proper nouns and kanji
- Ollama Support: Use local LLM models (Qwen3, etc.) for post-processing
- Inline Editing: Edit text, timestamps, and segment structure in-browser
- Speaker Management: Register speakers, assign per segment, auto-coloring (8 colors)
- Segment Operations: Merge, delete, add segments with time overlap validation
- Time Controls: Editable timestamps with ±0.1s nudge buttons
- Audio Playback: Integrated player bar, click any segment to play that range
- AI Suggestions: Per-segment AI corrections using glossary (supports Ollama)
- Speaker-filtered Export: Download SRT/VTT for a specific speaker only
- Metadata: SEO-optimized title, description with chapters, 15-25 tags
- Tone Reference: Match previous videos' writing style
- Catchphrases: 5 thumbnail text suggestions with style classification
- Quiz: 5 multiple-choice questions from video content
- Per-task model choice: Select provider + model when generating (Upload, History, Meta Editor)
- Settings defaults: Configure default models per provider, with optional refine model override
- Ollama integration: Dropdown populated from local Ollama instance, auto-resolves Docker networking
- Upload History: Grouped actions, status indicators, inline modal preview
- Cost Dashboard: Track API costs by provider, model, month, and operation
- Settings: Encrypted API keys, model presets, glossary, refine prompts, pricing config
- i18n: English / Japanese
git clone https://github.com/JFK/voicesrt.git
cd voicesrt
cp .env.example .env
# Set ENCRYPTION_KEY in .env
docker compose up --build
# Open http://localhost:8000 → Settings → Configure API keys# Start Ollama on host first: ollama serve
# Pull a model: ollama pull qwen3:8b
docker compose up --build
# Settings → Ollama Base URL: http://localhost:11434
# (Auto-resolved to host.docker.internal inside the container)- Install Docker Desktop with WSL 2 engine enabled
- Settings → Resources → WSL Integration → Enable for your distro
- In WSL terminal:
git clone https://github.com/JFK/voicesrt.git
cd voicesrt
cp .env.example .env
# Set ENCRYPTION_KEY in .env
docker compose up --build# Prerequisites: Python 3.11+, ffmpeg
pip install -e ".[dev]"
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
cp .env.example .env
# Set ENCRYPTION_KEY in .env
uvicorn src.main:app --reload --port 8000Settings → Enter OpenAI / Google API keys. For Ollama, set the base URL and select a model.
Upload → Drag & drop a file → Select transcription engine (Whisper / Gemini) → Choose post-processing model if refine is enabled → Upload & Process.
History → Edit → SRT Editor. Edit segments, assign speakers, use AI suggestions, merge/split segments, adjust timestamps. Download full SRT/VTT or per-speaker exports.
History → Meta → Metadata Editor. Set channel info, choose LLM model, generate title/description/chapters/tags.
History → Catchphrase / Quiz → Select model in modal → Generate.
- Backend: FastAPI (Python 3.11+), async/await
- Frontend: Jinja2 + HTMX + Alpine.js + Tailwind CSS (no build step)
- Database: SQLite (SQLAlchemy 2.0 async + aiosqlite + Alembic)
- AI: OpenAI Whisper/GPT, Google Gemini, Ollama (local)
- Audio: ffmpeg
- Security: Fernet encryption for API keys
- i18n: English / Japanese
| OpenAI Whisper | Google Gemini | Ollama (Local) | |
|---|---|---|---|
| Transcription | Yes (dedicated ASR) | Yes (multimodal LLM) | No (uses Whisper) |
| Post-processing | GPT models | Gemini models | Any local model |
| Cost | $0.006/min (STT) + LLM | ~$0.0005/min (Flash Lite) | Free (local hardware) |
| File Limit | 25MB (auto-chunking) | 9.5 hours | N/A |
| Privacy | Cloud | Cloud | Fully local |
MIT License


