Local-only TTS application with a FastAPI backend and a separate static frontend.
Balaka runs the speech model on your machine. There is no remote inference fallback, no Hugging Face Space dependency at request time, and no hidden service layer behind the API. The current default model is k2-fsa/OmniVoice.
- Local FastAPI API for speech synthesis
- Separate build-free frontend in
frontend/ - English-first UI with optional Ukrainian interface localization
- Two working modes:
- TTS with voice design
- Voice cloning from a reference recording
- Runtime metadata endpoint for languages and voice attributes
- Single-venv setup based on Python
3.13 - Tests for the public API surface
- Python
3.13 - FastAPI
- PyTorch
2.8 omnivoice==0.1.2- Plain HTML, CSS, and JavaScript
- macOS, Linux, or Windows with Python
3.13 - One project virtualenv named
.venv - Enough RAM or VRAM for the selected TTS model
torch 2.8 is not available for Python 3.14, so this project intentionally targets Python 3.13.
Create a clean environment:
rm -rf .venv
python3.13 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pipInstall runtime dependencies.
Apple Silicon:
pip install torch==2.8.0 torchaudio==2.8.0
pip install omnivoice==0.1.2
pip install -e ".[dev]"NVIDIA CUDA example:
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
pip install omnivoice==0.1.2
pip install -e ".[dev]"Create local configuration:
cp .env.example .envRun the app:
source .venv/bin/activate
balakaOpen http://127.0.0.1:8000/tts/.
Alternative:
python main.pyConvenience commands:
make run
make test
make smoke- Open
/tts/ - Leave advanced settings collapsed
- Enter text
- Click
Generate speech - Listen in the built-in player or download the generated WAV
The interface defaults to English. Use the EN / UA toggle in the header to switch the frontend locale; the selection is stored in the browser.
The default frontend setup is tuned for a convenient out-of-the-box result:
- Language:
Ukrainian - Voice:
female,young adult,moderate pitch - Steps:
32 - Guidance scale:
2.0 - Speed:
1.0 - Denoise: enabled
- Prompt preprocessing: enabled
- Output postprocessing: enabled
- Set the project interpreter to
.venv/bin/python - If
.venvwas previously created with Python3.14, delete it and recreate it with Python3.13 - Run
main.pyor thebalakaconsole command
Recommended .env values:
BALAKA_DEBUG=false
BALAKA_TTS_MODEL=k2-fsa/OmniVoice
BALAKA_TTS_DEVICE=auto
BALAKA_TTS_DTYPE=auto
BALAKA_TTS_PRELOAD_RUNTIME=true
BALAKA_TTS_LOAD_ASR=falseWhat each variable does:
| Variable | Default | Meaning |
|---|---|---|
BALAKA_DEBUG |
false |
Runs the app without autoreload. This avoids reloading the large model on every source change. |
BALAKA_TTS_MODEL |
k2-fsa/OmniVoice |
Hugging Face model ID or a local path to a compatible model directory. |
BALAKA_TTS_DEVICE |
auto |
Picks cuda, then mps, then cpu, unless you force a specific device. |
BALAKA_TTS_DTYPE |
auto |
Uses float16 on GPU or MPS and float32 on CPU. |
BALAKA_TTS_PRELOAD_RUNTIME |
true |
Loads the model during startup so the first synthesis request is not delayed by model initialization. |
BALAKA_TTS_LOAD_ASR |
false |
Keeps ASR disabled. Voice cloning in this project already requires explicit reference_text. |
Optional:
HF_TOKENis not required for normal local use after the model is cachedHF_TOKENcan still help on the first model download by raising Hugging Face rate limits
The first startup downloads the configured model into the local Hugging Face cache if it is not already present.
After that:
- repeated app starts reuse the local cache
- the backend should not fetch model files again unless the cache is missing
- model weights are still loaded into RAM or VRAM on every new process start, which is expected
Available endpoints:
GET /healthGET /api/v1/tts/metaPOST /api/v1/tts/designPOST /api/v1/tts/clone
Both synthesis endpoints return raw WAV bytes in the response body.
curl -X POST http://127.0.0.1:8000/api/v1/tts/design \
-F 'text=Привіт. Це локальна перевірка синтезу.' \
-F 'language=Ukrainian' \
-F 'gender=female' \
-F 'age=young adult' \
-F 'pitch=moderate pitch' \
-F 'style=Auto' \
-F 'accent=Auto' \
-F 'dialect=Auto' \
-F 'speed=1.0' \
-F 'num_steps=32' \
-F 'guidance_scale=2.0' \
-F 'denoise=true' \
-F 'preprocess_prompt=true' \
-F 'postprocess_output=true' \
--output design.wavcurl -X POST http://127.0.0.1:8000/api/v1/tts/clone \
-F 'reference_audio=@voice.wav' \
-F 'reference_text=Hello, this is the source voice.' \
-F 'text=Привіт. Це клонований голос.' \
-F 'language=Ukrainian' \
-F 'speed=1.0' \
-F 'num_steps=32' \
-F 'guidance_scale=2.0' \
-F 'denoise=true' \
-F 'preprocess_prompt=true' \
-F 'postprocess_output=true' \
--output clone.wav.
├── frontend/ # Static UI mounted by FastAPI
├── src/balaka/api/ # HTTP routes
├── src/balaka/core/ # Settings and voice metadata helpers
├── src/balaka/schemas/ # Pydantic request and response models
├── src/balaka/services/ # Local TTS runtime integration
├── tests/ # API tests
├── main.py # Local entrypoint
└── pyproject.toml
Run tests:
pytestOr use:
make test
make smokeThe frontend is intentionally build-free. Static files are served directly by FastAPI from frontend/.
This is expected. Recreate .venv with Python 3.13.
This is normal on the first run when the model is not cached yet.
This is expected when BALAKA_TTS_PRELOAD_RUNTIME=true, because the app loads the model before serving requests.
Make sure both of these are provided:
reference_audioreference_text
- This project is local-only by design
- There is no remote Space fallback
- The backend keeps model-specific logic inside the runtime service instead of leaking it into the API layer