Dockerized FastAPI wrapper for OmniVoice — zero-shot TTS for 600+ languages with voice cloning and voice design. Fully OpenAI-compatible speech endpoint.
/v1/audio/speech— OpenAI-compatible drop-in replacement (works with Open WebUI, SillyTavern, etc.)/v1/audio/clone— Zero-shot voice cloning from a reference audio file/v1/audio/design— Voice design via natural language attributes (e.g."female, low pitch, british accent")/web— Built-in web UI with TTS, Clone (upload or record from microphone), and Design tabs/docs— Swagger UI with inline audio player for all endpoints
No clone required — run directly from Docker Hub:
docker run --gpus all -p 8880:8880 \
-v omnivoice_models:/app/models \
--name omnivoice-fastapi \
diogod2r/omnivoice-fastapi:latestdocker run -p 8880:8880 \
-v omnivoice_models:/app/models \
-e DEVICE=cpu \
--name omnivoice-fastapi-cpu \
diogod2r/omnivoice-fastapi:cpuFirst run downloads the model (~4 GB) to the
omnivoice_modelsvolume. Subsequent starts are instant.
The API is available at http://localhost:8880/docs.
If you prefer compose, clone the repo and run:
# GPU
git clone https://github.com/diogod2r/omnivoice-fastapi.git
cd omnivoice-fastapi/docker/gpu
docker compose up
# CPU
cd omnivoice-fastapi/docker/cpu
docker compose up| Path | Method | Description |
|---|---|---|
/v1/audio/speech |
POST JSON | OpenAI-compatible TTS |
/v1/audio/clone |
POST multipart | Voice cloning from reference audio |
/v1/audio/design |
POST multipart | Voice design via text attributes |
/v1/voices |
GET | List voice presets |
/v1/languages |
GET | Common language IDs |
/v1/models |
GET | List available models |
/health |
GET | Health check |
/web |
GET | Web UI |
/docs |
GET | Swagger UI (with inline audio player) |
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8880/v1", api_key="not-needed")
with client.audio.speech.with_streaming_response.create(
model="omnivoice",
voice="female", # preset or any free-form instruct string
input="Hello from OmniVoice!",
extra_body={"language_id": "pt"}, # optional: force language
) as response:
response.stream_to_file("output.wav")| ID | Description |
|---|---|
auto |
Model chooses automatically |
female |
Generic female |
male |
Generic male |
female_en |
Female, American accent |
male_en |
Male, American accent |
female_br |
Female, British accent |
male_br |
Male, British accent |
child |
Child voice |
elderly |
Elderly female |
whisper |
Whispering female |
You can also pass any free-form instruct string as the voice parameter:
voice="male, low pitch, australian accent"
curl -X POST http://localhost:8880/v1/audio/clone \
-F "text=Hello, I am speaking in your voice." \
-F "ref_audio=@reference.wav" \
-F "language_id=pt" \
--output cloned.wavTip for Portuguese: pass
language_id=ptto avoid a European accent when synthesizing Brazilian Portuguese.
curl -X POST http://localhost:8880/v1/audio/design \
-F "text=This voice was designed from scratch." \
-F "instruct=female, low pitch, british accent" \
--output designed.wavAll three endpoints expose the full OmniVoice generate() API:
| Parameter | Default | Description |
|---|---|---|
language_id |
auto | ISO language code (pt, en, zh, …) — see /v1/languages |
speed |
1.0 | Speed factor (ignored when duration is set) |
duration |
— | Fixed output duration in seconds (overrides speed) |
num_step |
32 | Diffusion steps (lower = faster, higher = better quality) |
guidance_scale |
2.0 | Classifier-free guidance scale |
denoise |
true | Prepend <|denoise|> token (speech only) |
t_shift |
0.1 | Noise schedule time-step shift |
position_temperature |
5.0 | Mask-position selection temperature |
class_temperature |
0.0 | Token sampling temperature |
layer_penalty_factor |
5.0 | Deeper codebook layer penalty |
preprocess_prompt |
true | Remove silences from reference audio (clone only) |
postprocess_output |
true | Remove silences from generated audio |
audio_chunk_duration |
15.0 | Long-form chunk size (seconds) |
audio_chunk_threshold |
30.0 | Long-form activation threshold (seconds) |
| Variable | Default | Description |
|---|---|---|
MODEL_ID |
k2-fsa/OmniVoice |
HuggingFace model ID |
DEVICE |
auto-detected | cuda:0, cpu, or mps |
HF_HOME |
/app/models |
HuggingFace cache dir (mounted as Docker volume) |
HF_ENDPOINT |
— | Mirror URL, e.g. https://hf-mirror.com |
omnivoice-fastapi/
├── main.py # FastAPI app — all endpoints
├── index.html # Web UI (served at /web)
├── docker/
│ ├── gpu/
│ │ ├── Dockerfile # python:3.10-slim + omnivoice (GPU)
│ │ └── docker-compose.yml # NVIDIA GPU deployment
│ └── cpu/
│ ├── Dockerfile # python:3.11-slim + torch CPU
│ └── docker-compose.yml # CPU deployment
└── README.md
Apache 2.0 — same as OmniVoice.