AI-powered audio transcription, voice cloning, and image generation — all in one service
Transcodio is a production-ready platform powered by NVIDIA's Parakeet TDT 0.6B v3 model, deployed on Modal's serverless GPU infrastructure. It combines real-time streaming transcription, speaker diarization, AI meeting minutes, voice cloning with saved profiles, and text-to-image generation into a unified web application.
- Real-time Streaming Transcription: Progressive results via SSE with silence-based segmentation using NVIDIA Parakeet TDT 0.6B v3
- Speaker Diarization: Automatic speaker identification using NVIDIA TitaNet embeddings + AgglomerativeClustering
- Meeting Minutes: AI-powered summaries with action items using Anthropic Claude Haiku 4.5
- Voice Cloning: Clone any voice with Qwen3-TTS — upload or record reference audio (up to 5 minutes), then synthesize up to 50,000 characters of text
- Saved Voice Profiles: Persistently store voice profiles in Modal Volume for reuse without re-uploading
- Image Generation: Text-to-image using FLUX.1-schnell with 4-step inference (~3-5 seconds per image)
- Audio Playback: Integrated player to listen to uploaded audio alongside transcription
- Multiple Formats: Supports MP3, WAV, M4A, FLAC, OGG, WebM, MP4
- Subtitle Export: Download transcriptions as SRT/VTT with speaker labels
- Cost Effective: ~$0.006 per minute of audio transcription on NVIDIA L4 GPUs
- 10 Languages for TTS: Spanish, English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Italian
- Multilingual UI: English (default) and Spanish with one-click language toggle
- Security Hardened: API key authentication, per-IP rate limiting, CSP headers, path traversal protection, XSS prevention
- Python 3.12+
- uv — Fast Python package installer and runner
- Modal account (free tier available)
- FFmpeg installed locally
- Anthropic API key (for meeting minutes)
- HuggingFace token (for image generation — FLUX.1-schnell is a gated model)
- Clone the repository:
git clone https://github.com/dorianlgs/transcodio.git
cd transcodio- Install dependencies:
uv sync- Set up Modal authentication:
py -m modal setup- Create Modal secrets:
# Anthropic API key (required for meeting minutes)
py -m modal secret create anthropic-api-key ANTHROPIC_API_KEY=sk-ant-...
# HuggingFace token (required for image generation)
py -m modal secret create hf-token HF_TOKEN=hf_...- (Optional) Set API key for production:
# Set an API key to require authentication on all /api/* endpoints
# When unset, authentication is disabled (development mode)
export TRANSCODIO_API_KEY=your-secret-api-key- Deploy the Modal backend:
py -m modal deploy modal_app/app.pyThis deploys 6 Modal classes:
- ParakeetSTTModel — GPU transcription (streaming + non-streaming)
- SpeakerDiarizerModel — Speaker identification with TitaNet
- MeetingMinutesGenerator — Claude Haiku 4.5 meeting summaries
- Qwen3TTSVoiceCloner — Voice cloning and synthesis
- VoiceStorage — Persistent voice profile management
- FluxImageGenerator — Text-to-image generation
- Start the FastAPI server:
uv run uvicorn api.main:app --reload- Open
http://localhost:8000
The UI has three modes, with English as the default language. Click the ES/EN button in the header to switch between English and Spanish (preference is saved across sessions).
Transcription
- Drag and drop an audio file or click to browse
- Toggle optional features: Speaker Diarization, Meeting Minutes
- Watch real-time transcription results appear segment by segment
- View speaker-labeled segments and meeting minutes tabs
- Copy text, download transcription (TXT), or export subtitles (SRT/VTT)
- Listen to original audio with the integrated player
Voice Cloning
- Choose a saved voice or create a new one
- For new voices: upload reference audio (3s–5min) or record with your microphone
- Enter the reference transcription and select the language
- Enter target text to synthesize (up to 50,000 characters)
- Click Generate — listen to and download the result
- Optionally save the voice profile for reuse
Image Generation
- Enter a text prompt (up to 500 characters)
- Select dimensions (512x512, 768x768, or 1024x1024)
- Click Generate — preview and download the image
# Basic transcription
uv run transcribe_file.py audio.mp3
# Save to file
uv run transcribe_file.py audio.mp3 -o transcript.txt
# Non-streaming mode
uv run transcribe_file.py audio.mp3 --no-stream
# Process multiple files
uv run transcribe_file.py *.mp3
# All options
uv run transcribe_file.py --helpAuthentication: If
TRANSCODIO_API_KEYis set, add-H "X-API-Key: YOUR_KEY"to all requests.Rate Limits: Transcription: 5/min, Voice clone & Image gen: 10/min, Other: 30/min (per IP).
POST /api/transcribe — Complete transcription (non-streaming)
curl -X POST "http://localhost:8000/api/transcribe" -F "file=@audio.mp3"{
"text": "Complete transcription...",
"language": "en",
"duration": 45.2,
"segments": [
{ "id": 0, "start": 0.0, "end": 3.5, "text": "First segment..." }
]
}POST /api/transcribe/stream — Streaming transcription (SSE)
curl -X POST "http://localhost:8000/api/transcribe/stream" \
-F "file=@audio.mp3" \
-F "enable_diarization=true" \
-F "enable_minutes=true"SSE events: metadata → progress (per segment) → speakers_ready → minutes_ready → complete
GET /api/audio/{session_id} — Retrieve cached audio for playback
POST /api/voice-clone — Clone a voice and synthesize text (single-shot)
curl -X POST "http://localhost:8000/api/voice-clone" \
-F "ref_audio=@reference.wav" \
-F "ref_text=Hello, this is my voice." \
-F "target_text=Text to synthesize with the cloned voice." \
-F "language=en"GET /api/voices — List all saved voice profiles
POST /api/voices — Save a new voice profile
curl -X POST "http://localhost:8000/api/voices" \
-F "name=My Voice" \
-F "ref_audio=@reference.wav" \
-F "ref_text=Reference transcription" \
-F "language=en"DELETE /api/voices/{voice_id} — Delete a saved voice
POST /api/synthesize — Synthesize text with a saved voice
curl -X POST "http://localhost:8000/api/synthesize" \
-F "voice_id=abc123" \
-F "target_text=Text to synthesize."POST /api/generate-image — Generate image from text prompt
curl -X POST "http://localhost:8000/api/generate-image" \
-F "prompt=a beautiful sunset over mountains" \
-F "width=768" \
-F "height=768"GET /api/image/{session_id} — Retrieve generated image as PNG
GET /health — Health check
┌─────────────────────┐
│ Web UI │
│ (HTML/JS/CSS) │
│ 3 modes: │
│ Transcription │
│ Voice Cloning │
│ Image Generation │
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ FastAPI │
│ API Key Auth │
│ Rate Limiting │
│ Security Headers │
│ Upload/Validation │
│ SSE Streaming │
│ Session Caching │
└─────────┬───────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ Modal Serverless GPUs │
│ │
│ ┌──────────────┐ ┌───────────────────────────┐ │
│ │ Parakeet STT │ │ TitaNet Diarization (GPU) │ │
│ │ (L4 GPU) │ └───────────────────────────┘ │
│ └──────────────┘ │
│ ┌──────────────┐ ┌───────────────────────────┐ │
│ │ Qwen3-TTS │ │ FLUX.1-schnell (L4 GPU) │ │
│ │ Voice Clone │ │ Image Generation │ │
│ │ (L4 GPU) │ └───────────────────────────┘ │
│ └──────────────┘ │
│ ┌──────────────┐ ┌───────────────────────────┐ │
│ │ Claude Haiku │ │ Voice Storage │ │
│ │ Minutes (CPU)│ │ (Modal Volume) │ │
│ └──────────────┘ └───────────────────────────┘ │
└───────────────────────────────────────────────────┘
Transcodio includes multiple layers of security hardening:
| Layer | Protection | Details |
|---|---|---|
| Authentication | API key via X-API-Key header |
Set TRANSCODIO_API_KEY env var; disabled when empty (dev mode) |
| Rate Limiting | Per-IP limits via slowapi |
5/min transcription, 10/min voice-clone & image, 30/min other |
| CSP | Content Security Policy | default-src 'self'; blocks inline scripts, external resources |
| Clickjacking | X-Frame-Options: DENY |
Prevents embedding in iframes |
| MIME Sniffing | X-Content-Type-Options: nosniff |
Forces declared content type |
| Path Traversal | UUID validation + path resolution | Blocks ../../ attacks on voice storage and cached files |
| XSS | DOM-safe rendering | Server data rendered via textContent/escapeHtml(), not raw innerHTML |
| Header Injection | Filename sanitization | Strips control chars; maps to safe MIME types |
| Error Hardening | Generic error messages | Internal details never exposed to clients |
Modal-specific settings are embedded in modal_app/app.py. API-layer settings are in config.py.
Key parameters:
# Transcription
STT_MODEL_ID = "nvidia/parakeet-tdt-0.6b-v3"
SAMPLE_RATE = 16000
MODAL_GPU_TYPE = "L4"
# Silence Detection (streaming segmentation)
SILENCE_THRESHOLD_DB = -40
SILENCE_MIN_LENGTH_MS = 700
# Speaker Diarization
ENABLE_SPEAKER_DIARIZATION = True
DIARIZATION_MAX_SPEAKERS = 5
# Meeting Minutes
ENABLE_MEETING_MINUTES = True
ANTHROPIC_MODEL_ID = "claude-haiku-4-5-20251001"
# Voice Cloning
VOICE_CLONE_MAX_REF_DURATION = 300 # 5 minutes
VOICE_CLONE_MAX_TARGET_TEXT = 50000 # characters
MAX_SAVED_VOICES = 50
# Image Generation
ENABLE_IMAGE_GENERATION = True
IMAGE_NUM_INFERENCE_STEPS = 4
# File Limits
MAX_FILE_SIZE_MB = 100
MAX_DURATION_SECONDS = 3600 # 60 minutes
# Performance
MODAL_CONTAINER_IDLE_TIMEOUT = 120
ENABLE_GPU_MEMORY_SNAPSHOT = True # 85-90% faster cold starts
# Security
API_KEY = os.getenv("TRANSCODIO_API_KEY", "") # Empty = no auth (dev)
RATE_LIMIT_TRANSCRIBE = "5/minute"
RATE_LIMIT_VOICE_CLONE = "10/minute"
RATE_LIMIT_IMAGE = "10/minute"
RATE_LIMIT_DEFAULT = "30/minute"- Audio: MP3, WAV, M4A, FLAC, OGG, WebM
- Video: MP4 (audio track extracted)
- Max file size: 100MB (transcription), 15MB (voice reference)
- Max duration: 60 minutes (transcription), 5 minutes (voice reference)
| Feature | Cost |
|---|---|
| Transcription | ~$0.006 per minute of audio |
| Speaker Diarization | Included (same GPU) |
| Meeting Minutes | ~$0.001 per request (Haiku API) |
| Voice Cloning | ~$0.01-0.02 per synthesis |
| Image Generation | ~$0.01-0.02 per image |
transcodio/
├── modal_app/
│ ├── app.py # 6 Modal classes (STT, diarization, TTS, image gen, minutes, storage)
│ └── image.py # Image generation helper
├── api/
│ ├── main.py # FastAPI endpoints + auth, rate limiting, security headers
│ ├── models.py # Pydantic response models
│ └── streaming.py # SSE streaming utilities
├── static/
│ ├── index.html # Web UI layout
│ ├── app.js # Frontend logic + i18n translations
│ └── styles.css # Styling
├── utils/
│ └── audio.py # Audio validation pipeline
├── config.py # Configuration constants
├── transcribe_file.py # CLI transcription tool
├── requirements.txt # Dependencies
├── CLAUDE.md # Development guide
└── README.md
"Invalid or missing API key" (401) — Set the X-API-Key header. If in development, ensure TRANSCODIO_API_KEY is not set (authentication is disabled when empty).
"Rate limit exceeded" (429) — Wait and retry. Default limits: 5/min for transcription, 10/min for voice-clone/image, 30/min for other endpoints.
"Modal service unavailable" — Deploy the Modal app first:
py -m modal deploy modal_app/app.pySlow first request — Cold start takes 30-60s. GPU memory snapshots are enabled by default for 85-90% faster subsequent cold starts.
Meeting minutes not working — Check the Anthropic API key secret:
py -m modal secret list
py -m modal secret create anthropic-api-key ANTHROPIC_API_KEY=sk-ant-...Image generation not working — Check the HuggingFace token secret:
py -m modal secret list
py -m modal secret create hf-token HF_TOKEN=hf_...Audio validation errors — Ensure FFmpeg is installed:
ffmpeg -versionVoice cloning fails — Ensure Modal app is deployed with all classes. Check reference audio is 3s–5min and in a supported format.
- NVIDIA NeMo — Parakeet TDT model
- NVIDIA TitaNet — Speaker embeddings
- Qwen3-TTS — Voice cloning model
- FLUX.1-schnell — Image generation model
- Anthropic Claude — Meeting minutes generation
- Modal — Serverless GPU infrastructure
- FastAPI — Web framework
MIT License — see LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
- Open an issue on GitHub
- Check the Modal documentation