Transcodio

AI-powered audio transcription, voice cloning, and image generation — all in one service

Transcodio is a production-ready platform powered by NVIDIA's Parakeet TDT 0.6B v3 model, deployed on Modal's serverless GPU infrastructure. It combines real-time streaming transcription, speaker diarization, AI meeting minutes, voice cloning with saved profiles, and text-to-image generation into a unified web application.

Features

Real-time Streaming Transcription: Progressive results via SSE with silence-based segmentation using NVIDIA Parakeet TDT 0.6B v3
Speaker Diarization: Automatic speaker identification using NVIDIA TitaNet embeddings + AgglomerativeClustering
Meeting Minutes: AI-powered summaries with action items using Anthropic Claude Haiku 4.5
Voice Cloning: Clone any voice with Qwen3-TTS — upload or record reference audio (up to 5 minutes), then synthesize up to 50,000 characters of text
Saved Voice Profiles: Persistently store voice profiles in Modal Volume for reuse without re-uploading
Image Generation: Text-to-image using FLUX.1-schnell with 4-step inference (~3-5 seconds per image)
Audio Playback: Integrated player to listen to uploaded audio alongside transcription
Multiple Formats: Supports MP3, WAV, M4A, FLAC, OGG, WebM, MP4
Subtitle Export: Download transcriptions as SRT/VTT with speaker labels
Cost Effective: ~$0.006 per minute of audio transcription on NVIDIA L4 GPUs
10 Languages for TTS: Spanish, English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Italian
Multilingual UI: English (default) and Spanish with one-click language toggle
Security Hardened: API key authentication, per-IP rate limiting, CSP headers, path traversal protection, XSS prevention

Quick Start

Prerequisites

Python 3.12+
uv — Fast Python package installer and runner
Modal account (free tier available)
FFmpeg installed locally
Anthropic API key (for meeting minutes)
HuggingFace token (for image generation — FLUX.1-schnell is a gated model)

Installation

Clone the repository:

git clone https://github.com/dorianlgs/transcodio.git
cd transcodio

Install dependencies:

uv sync

Set up Modal authentication:

py -m modal setup

Create Modal secrets:

# Anthropic API key (required for meeting minutes)
py -m modal secret create anthropic-api-key ANTHROPIC_API_KEY=sk-ant-...

# HuggingFace token (required for image generation)
py -m modal secret create hf-token HF_TOKEN=hf_...

(Optional) Set API key for production:

# Set an API key to require authentication on all /api/* endpoints
# When unset, authentication is disabled (development mode)
export TRANSCODIO_API_KEY=your-secret-api-key

Deployment

Deploy the Modal backend:

py -m modal deploy modal_app/app.py

This deploys 6 Modal classes:

ParakeetSTTModel — GPU transcription (streaming + non-streaming)
SpeakerDiarizerModel — Speaker identification with TitaNet
MeetingMinutesGenerator — Claude Haiku 4.5 meeting summaries
Qwen3TTSVoiceCloner — Voice cloning and synthesis
VoiceStorage — Persistent voice profile management
FluxImageGenerator — Text-to-image generation

Start the FastAPI server:

uv run uvicorn api.main:app --reload

Open http://localhost:8000

Usage

Web Interface

The UI has three modes, with English as the default language. Click the ES/EN button in the header to switch between English and Spanish (preference is saved across sessions).

Transcription

Drag and drop an audio file or click to browse
Toggle optional features: Speaker Diarization, Meeting Minutes
Watch real-time transcription results appear segment by segment
View speaker-labeled segments and meeting minutes tabs
Copy text, download transcription (TXT), or export subtitles (SRT/VTT)
Listen to original audio with the integrated player

Voice Cloning

Choose a saved voice or create a new one
For new voices: upload reference audio (3s–5min) or record with your microphone
Enter the reference transcription and select the language
Enter target text to synthesize (up to 50,000 characters)
Click Generate — listen to and download the result
Optionally save the voice profile for reuse

Image Generation

Enter a text prompt (up to 500 characters)
Select dimensions (512x512, 768x768, or 1024x1024)
Click Generate — preview and download the image

CLI Tool

# Basic transcription
uv run transcribe_file.py audio.mp3

# Save to file
uv run transcribe_file.py audio.mp3 -o transcript.txt

# Non-streaming mode
uv run transcribe_file.py audio.mp3 --no-stream

# Process multiple files
uv run transcribe_file.py *.mp3

# All options
uv run transcribe_file.py --help

API Endpoints

Authentication: If TRANSCODIO_API_KEY is set, add -H "X-API-Key: YOUR_KEY" to all requests.

Rate Limits: Transcription: 5/min, Voice clone & Image gen: 10/min, Other: 30/min (per IP).

Transcription

POST /api/transcribe — Complete transcription (non-streaming)

curl -X POST "http://localhost:8000/api/transcribe" -F "file=@audio.mp3"

{
  "text": "Complete transcription...",
  "language": "en",
  "duration": 45.2,
  "segments": [
    { "id": 0, "start": 0.0, "end": 3.5, "text": "First segment..." }
  ]
}

POST /api/transcribe/stream — Streaming transcription (SSE)

curl -X POST "http://localhost:8000/api/transcribe/stream" \
  -F "file=@audio.mp3" \
  -F "enable_diarization=true" \
  -F "enable_minutes=true"

SSE events: metadata → progress (per segment) → speakers_ready → minutes_ready → complete

GET /api/audio/{session_id} — Retrieve cached audio for playback

Voice Cloning

POST /api/voice-clone — Clone a voice and synthesize text (single-shot)

curl -X POST "http://localhost:8000/api/voice-clone" \
  -F "ref_audio=@reference.wav" \
  -F "ref_text=Hello, this is my voice." \
  -F "target_text=Text to synthesize with the cloned voice." \
  -F "language=en"

GET /api/voices — List all saved voice profiles

POST /api/voices — Save a new voice profile

curl -X POST "http://localhost:8000/api/voices" \
  -F "name=My Voice" \
  -F "ref_audio=@reference.wav" \
  -F "ref_text=Reference transcription" \
  -F "language=en"

DELETE /api/voices/{voice_id} — Delete a saved voice

POST /api/synthesize — Synthesize text with a saved voice

curl -X POST "http://localhost:8000/api/synthesize" \
  -F "voice_id=abc123" \
  -F "target_text=Text to synthesize."

Image Generation

POST /api/generate-image — Generate image from text prompt

curl -X POST "http://localhost:8000/api/generate-image" \
  -F "prompt=a beautiful sunset over mountains" \
  -F "width=768" \
  -F "height=768"

GET /api/image/{session_id} — Retrieve generated image as PNG

Health

GET /health — Health check

Architecture

┌─────────────────────┐
│      Web UI         │
│  (HTML/JS/CSS)      │
│  3 modes:           │
│  Transcription      │
│  Voice Cloning      │
│  Image Generation   │
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│      FastAPI        │
│  API Key Auth       │
│  Rate Limiting      │
│  Security Headers   │
│  Upload/Validation  │
│  SSE Streaming      │
│  Session Caching    │
└─────────┬───────────┘
          │
          ▼
┌───────────────────────────────────────────────────┐
│              Modal Serverless GPUs                 │
│                                                   │
│  ┌──────────────┐  ┌───────────────────────────┐ │
│  │ Parakeet STT │  │ TitaNet Diarization (GPU) │ │
│  │ (L4 GPU)     │  └───────────────────────────┘ │
│  └──────────────┘                                 │
│  ┌──────────────┐  ┌───────────────────────────┐ │
│  │ Qwen3-TTS    │  │ FLUX.1-schnell (L4 GPU)   │ │
│  │ Voice Clone  │  │ Image Generation          │ │
│  │ (L4 GPU)     │  └───────────────────────────┘ │
│  └──────────────┘                                 │
│  ┌──────────────┐  ┌───────────────────────────┐ │
│  │ Claude Haiku │  │ Voice Storage             │ │
│  │ Minutes (CPU)│  │ (Modal Volume)            │ │
│  └──────────────┘  └───────────────────────────┘ │
└───────────────────────────────────────────────────┘

Security

Transcodio includes multiple layers of security hardening:

Layer	Protection	Details
Authentication	API key via `X-API-Key` header	Set `TRANSCODIO_API_KEY` env var; disabled when empty (dev mode)
Rate Limiting	Per-IP limits via `slowapi`	5/min transcription, 10/min voice-clone & image, 30/min other
CSP	Content Security Policy	`default-src 'self'`; blocks inline scripts, external resources
Clickjacking	`X-Frame-Options: DENY`	Prevents embedding in iframes
MIME Sniffing	`X-Content-Type-Options: nosniff`	Forces declared content type
Path Traversal	UUID validation + path resolution	Blocks `../../` attacks on voice storage and cached files
XSS	DOM-safe rendering	Server data rendered via `textContent`/`escapeHtml()`, not raw `innerHTML`
Header Injection	Filename sanitization	Strips control chars; maps to safe MIME types
Error Hardening	Generic error messages	Internal details never exposed to clients

Configuration

Modal-specific settings are embedded in modal_app/app.py. API-layer settings are in config.py.

Key parameters:

# Transcription
STT_MODEL_ID = "nvidia/parakeet-tdt-0.6b-v3"
SAMPLE_RATE = 16000
MODAL_GPU_TYPE = "L4"

# Silence Detection (streaming segmentation)
SILENCE_THRESHOLD_DB = -40
SILENCE_MIN_LENGTH_MS = 700

# Speaker Diarization
ENABLE_SPEAKER_DIARIZATION = True
DIARIZATION_MAX_SPEAKERS = 5

# Meeting Minutes
ENABLE_MEETING_MINUTES = True
ANTHROPIC_MODEL_ID = "claude-haiku-4-5-20251001"

# Voice Cloning
VOICE_CLONE_MAX_REF_DURATION = 300   # 5 minutes
VOICE_CLONE_MAX_TARGET_TEXT = 50000  # characters
MAX_SAVED_VOICES = 50

# Image Generation
ENABLE_IMAGE_GENERATION = True
IMAGE_NUM_INFERENCE_STEPS = 4

# File Limits
MAX_FILE_SIZE_MB = 100
MAX_DURATION_SECONDS = 3600  # 60 minutes

# Performance
MODAL_CONTAINER_IDLE_TIMEOUT = 120
ENABLE_GPU_MEMORY_SNAPSHOT = True  # 85-90% faster cold starts

# Security
API_KEY = os.getenv("TRANSCODIO_API_KEY", "")  # Empty = no auth (dev)
RATE_LIMIT_TRANSCRIBE = "5/minute"
RATE_LIMIT_VOICE_CLONE = "10/minute"
RATE_LIMIT_IMAGE = "10/minute"
RATE_LIMIT_DEFAULT = "30/minute"

Supported Formats

Audio: MP3, WAV, M4A, FLAC, OGG, WebM
Video: MP4 (audio track extracted)
Max file size: 100MB (transcription), 15MB (voice reference)
Max duration: 60 minutes (transcription), 5 minutes (voice reference)

Cost Analysis

Feature	Cost
Transcription	~$0.006 per minute of audio
Speaker Diarization	Included (same GPU)
Meeting Minutes	~$0.001 per request (Haiku API)
Voice Cloning	~$0.01-0.02 per synthesis
Image Generation	~$0.01-0.02 per image

Project Structure

transcodio/
├── modal_app/
│   ├── app.py             # 6 Modal classes (STT, diarization, TTS, image gen, minutes, storage)
│   └── image.py           # Image generation helper
├── api/
│   ├── main.py            # FastAPI endpoints + auth, rate limiting, security headers
│   ├── models.py          # Pydantic response models
│   └── streaming.py       # SSE streaming utilities
├── static/
│   ├── index.html         # Web UI layout
│   ├── app.js             # Frontend logic + i18n translations
│   └── styles.css         # Styling
├── utils/
│   └── audio.py           # Audio validation pipeline
├── config.py              # Configuration constants
├── transcribe_file.py     # CLI transcription tool
├── requirements.txt       # Dependencies
├── CLAUDE.md              # Development guide
└── README.md

Troubleshooting

"Invalid or missing API key" (401) — Set the X-API-Key header. If in development, ensure TRANSCODIO_API_KEY is not set (authentication is disabled when empty).

"Rate limit exceeded" (429) — Wait and retry. Default limits: 5/min for transcription, 10/min for voice-clone/image, 30/min for other endpoints.

"Modal service unavailable" — Deploy the Modal app first:

py -m modal deploy modal_app/app.py

Slow first request — Cold start takes 30-60s. GPU memory snapshots are enabled by default for 85-90% faster subsequent cold starts.

Meeting minutes not working — Check the Anthropic API key secret:

py -m modal secret list
py -m modal secret create anthropic-api-key ANTHROPIC_API_KEY=sk-ant-...

Image generation not working — Check the HuggingFace token secret:

py -m modal secret list
py -m modal secret create hf-token HF_TOKEN=hf_...

Audio validation errors — Ensure FFmpeg is installed:

ffmpeg -version

Voice cloning fails — Ensure Modal app is deployed with all classes. Check reference audio is 3s–5min and in a supported format.

Acknowledgments

NVIDIA NeMo — Parakeet TDT model
NVIDIA TitaNet — Speaker embeddings
Qwen3-TTS — Voice cloning model
FLUX.1-schnell — Image generation model
Anthropic Claude — Meeting minutes generation
Modal — Serverless GPU infrastructure
FastAPI — Web framework

License

MIT License — see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

Open an issue on GitHub
Check the Modal documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transcodio

Features

Quick Start

Prerequisites

Installation

Deployment

Usage

Web Interface

CLI Tool

API Endpoints

Transcription

Voice Cloning

Image Generation

Health

Architecture

Security

Configuration

Supported Formats

Cost Analysis

Project Structure

Troubleshooting

Acknowledgments

License

Contributing

Support

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
api		api
modal_app		modal_app
static		static
utils		utils
.gitignore		.gitignore
=0.1.9		=0.1.9
BENCHMARK_COLD_START.md		BENCHMARK_COLD_START.md
CLAUDE.md		CLAUDE.md
IMPROVEMENTS.md		IMPROVEMENTS.md
LICENSE		LICENSE
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt
transcribe_file.py		transcribe_file.py

License

dorianlgs/transcodio

Folders and files

Latest commit

History

Repository files navigation

Transcodio

Features

Quick Start

Prerequisites

Installation

Deployment

Usage

Web Interface

CLI Tool

API Endpoints

Transcription

Voice Cloning

Image Generation

Health

Architecture

Security

Configuration

Supported Formats

Cost Analysis

Project Structure

Troubleshooting

Acknowledgments

License

Contributing

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages