OpenAI-compatible Text-to-Speech API
Features β’ Quick Start β’ API Reference β’ Deployment β’ Configuration
- π OpenAI-Compatible API β Drop-in replacement for OpenAI's TTS API, no auth required
- β‘ High Performance β Dedicated thread pool, async synthesis, semaphore-based concurrency control
- π΅ Multiple Formats β MP3, WAV, FLAC, Opus, AAC, PCM via PyAV
- π£οΈ Multiple Voices β OpenAI voice names + native Supertonic styles + custom/mixed voice upload
- π³ Docker Ready β Production containerization with nginx load balancer and persistent model cache
- π GPU Acceleration β CUDA, CoreML, and Metal backends via ONNX Runtime
- π Smart Text Processing β Unicode normalization, emoji removal, auto-chunking, pause tags
- π 31 Languages β Full multilingual support via supertonic-3
- Python 3.10+
- ONNX Runtime (CPU/CUDA/CoreML)
- Supertonic TTS library
git clone https://github.com/confused-ai/supertonic-api.git
cd supertonic-api
# Start API + nginx load balancer
docker compose up -d
# API available at http://localhost:8800
# Model downloads once and is cached in a Docker volumegit clone https://github.com/confused-ai/supertonic-api.git
cd supertonic-api
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --host 0.0.0.0 --port 8800curl -X POST "http://localhost:8800/v1/audio/speech" \
-H "Content-Type: application/json" \
-d '{"input": "Hello from Supertonic!", "voice": "alloy", "response_format": "mp3"}' \
--output speech.mp3POST /v1/audio/speech
curl -X POST "http://localhost:8800/v1/audio/speech" \
-H "Content-Type: application/json" \
-d '{"model": "tts-1", "input": "Your text here...", "voice": "alloy", "response_format": "mp3", "speed": 1.0}' \
--output output.mp3| Parameter | Type | Default | Description |
|---|---|---|---|
model |
string | tts-1 |
Accepted for OpenAI compatibility; model is supertonic-3 |
input |
string | β | Text to synthesize (1β4096 chars) |
voice |
string | alloy |
Preset voice ID (see table below) or custom/mixed voice ID |
response_format |
string | mp3 |
mp3, opus, aac, flac, wav, pcm |
speed |
float | 1.0 |
Speed multiplier (0.5β2.0) |
normalize |
boolean | true |
Unicode normalization, emoji removal, punctuation fix |
lang |
string | en |
BCP-47 language code (31 languages supported + na) |
GET /v1/models
curl "http://localhost:8800/v1/models"10 native Supertonic styles exposed via 13 OpenAI-compatible voice IDs:
| Voice ID | Style | Character |
|---|---|---|
alloy |
F1 | Calm, clear female |
nova |
F2 | Bright, professional female |
shimmer |
F3 | Soft, expressive female |
ash |
F4 | Energetic, versatile female |
ballad |
F4 | Melodic, smooth female (shares style with ash) |
coral |
F5 | Airy, warm female |
marin |
F5 | Gentle, natural female (shares style with coral) |
echo |
M1 | Lively, upbeat male |
fable |
M2 | Warm, narrative male |
onyx |
M3 | Deep, authoritative male |
cedar |
M4 | Measured, resonant male |
sage |
M4 | Calm, steady male (shares style with cedar) |
verse |
M5 | Dynamic, dramatic male |
GET /v1/voices β full voice list with types (preset / custom / mixed)
GET /voices β legacy alias
curl "http://localhost:8800/v1/voices"POST /v1/voices/upload
curl -X POST "http://localhost:8800/v1/voices/upload" \
-F "file=@my_voice.json" \
-F "name=my-voice"POST /v1/voices/mix
curl -X POST "http://localhost:8800/v1/voices/mix" \
-H "Content-Type: application/json" \
-d '{"voice_a": "alloy", "voice_b": "echo", "weight": 0.5, "name": "alloy-echo"}'DELETE /v1/voices/{voice_id}
curl -X DELETE "http://localhost:8800/v1/voices/mix:alloy-echo"GET /health
curl "http://localhost:8800/health"| Voice | Style | Description |
|---|---|---|
alloy |
F1 | Sarah β calm female |
echo |
M1 | Alex β lively upbeat male |
fable |
F2 | Lily β bright cheerful female |
onyx |
M2 | James β deep robust male |
nova |
F3 | Jessica β professional announcer |
shimmer |
M3 | Robert β polished authoritative male |
You can also use any native supertonic style name directly (e.g. F4, M5) or a custom/mixed voice ID.
Environment variables can be set in .env file:
# Server
HOST=0.0.0.0
PORT=8800
LOG_LEVEL=INFO
# Model Performance
MODEL_THREADS=12 # ONNX intra-op threads
MODEL_INTER_THREADS=4 # ONNX inter-op threads
MAX_WORKERS=8 # Concurrent synthesis workers + semaphore limit
# GPU Acceleration
FORCE_PROVIDERS=auto # auto | cuda | coreml | metal | cpu
# Audio
SAMPLE_RATE=44100
MAX_CHUNK_LENGTH=300 # Max chars per synthesis chunk
# HuggingFace model cache (mounted as Docker volume)
HF_HOME=/root/.cache/huggingfaceSet FORCE_PROVIDERS based on your hardware:
| Value | Description |
|---|---|
auto |
Auto-detect best available provider |
cuda |
NVIDIA GPU acceleration |
coreml |
Apple CoreML (M-series chips) |
metal |
Apple Metal (maps to CoreML) |
cpu |
CPU only |
docker compose up -d --buildServices:
- api β FastAPI + uvicorn on port 8801 (internal)
- lb β nginx reverse proxy on port 8800 (public)
- hf_cache β named Docker volume; model downloads once, reused on every restart
To scale API workers:
docker compose up -d --scale api=2- Dedicated thread pool β synthesis runs in isolated
ThreadPoolExecutor, never blocks the I/O loop - Thread-safe model init β double-checked locking; model loads once across all workers
- Semaphore-bounded concurrency β
MAX_WORKERScap prevents memory exhaustion under load - PyAV streaming encoder β chunks encoded on-the-fly, no full audio buffering
- Pre-compiled regex β text normalization patterns compiled at startup
- Smart chunking β long text split at sentence/paragraph boundaries, preserves
[pause:N]tags
pip install -r requirements.txt
# Dev server with auto-reload
uvicorn app.main:app --reload --port 8800
# Run all tests (unit + integration + eval)
python tests/run_all.py
# Unit tests only (no server needed)
python tests/run_all.py --unit-only
# With stress test
python tests/run_all.py --stress --concurrency 20 --requests 200
# Custom server
python tests/run_all.py --url http://localhost:8801supertonic-api/
βββ app/
β βββ api/
β β βββ routes/ # Endpoint modules (speech, voices, models)
β β βββ schemas.py # Pydantic I/O models
β βββ core/
β β βββ config.py # pydantic-settings (.env)
β β βββ constants.py # Model name
β β βββ voices.py # OpenAI β Supertonic voice map
β βββ services/
β β βββ tts.py # Singleton TTS service + async generation
β β βββ audio.py # AudioNormalizer, AudioService
β β βββ audio_encoder.py # PyAV streaming encoder (mp3/wav/flac/opus/aac/pcm)
β βββ utils/
β β βββ text.py # clean_text(), smart_split()
β βββ inference/
β β βββ base.py # AudioChunk dataclass
β βββ main.py # FastAPI app + lifespan
βββ tests/
β βββ run_all.py # Unified test runner
β βββ output/ # Saved test audio files
βββ Dockerfile
βββ docker-compose.yml # api + nginx lb + hf_cache volume
nginx.conf
βββ requirements.txt
Contributions welcome. See CONTRIBUTING.md.
- Fork β branch β commit β PR
- Run
python tests/run_all.py --unit-onlybefore submitting
- Supertonic - TTS engine
- FastAPI - Web framework
- PyAV - Audio encoding
Made with β€οΈ by the community