Create natural, multi-speaker conversations with IndexTTS2 - fast.
Turn scripts into realistic dialogue with speaker prep, line-by-line review, regeneration, timeline editing, and polished local export on top of the official IndexTTS2 models.
Bring your own voices. This repo does not ship bundled voice clones or private speaker files.
If you already have Docker and an NVIDIA-ready setup, the fastest path is:
- Put your models in
shared/models/checkpoints - Run
docker\start.bat - Open http://localhost:3000
- Start in
Speaker Prep, then move throughConversation Workflow,Conversation Results, andTimeline Editor
Good entry points:
- Demo scenes: Hear What It Can Do
- Quick start: Quick Start
- Manual with screenshots: docs/manual/USER_MANUAL.md
- Docker details: docker/README.md
- Contributing: CONTRIBUTING.md
- Security reporting: SECURITY.md
If you want proof before setup details, start with three short scene demos rendered through this workflow:
| Demo | Play | What it shows |
|---|---|---|
| Podcast roundtable | Play in browser | Quick multi-speaker turn-taking and pacing |
| Audiobook night train | Play in browser | Steadier narration with longer phrasing |
| Game dialogue checkpoint breach | Play in browser | Tighter back-and-forth with more urgent timing |
These are public-safe sample scenes rendered from the local speaker library already used in the app.
flowchart LR
A["Speaker Prep<br/>clean and score source clips"] --> B["Conversation Workflow<br/>generate first passes"]
B --> C["Conversation Results<br/>compare takes and regenerate weak lines"]
C --> D["Timeline Editor<br/>arrange timing, overlaps, and scenes"]
D --> E["Export Mix<br/>render the final scene"]
For a moving walkthrough, open the timeline workflow clip.
This repository is not the official IndexTTS model repository.
It is a practical local app built on top of the official IndexTTS ecosystem, with:
Speaker Prepfor cleaning and evaluating source clipsConversation Workflowfor fast multi-speaker script generationConversation Resultsfor review, version selection, and regenerationTimeline Editorfor scene timing, overlaps, and final arrangement- a Docker-first local runtime with GPU-first behavior and CPU fallback
If you want the upstream project, papers, and hosted demos, use:
- Official code: index-tts/index-tts
- Official model: IndexTeam/IndexTTS-2
- Official demo: IndexTTS-2 Demo
- Paper: arXiv 2506.21619
This app is built for longer-form, practical TTS work rather than one-off single-line demos.
Example use cases:
- build a 3-speaker podcast or panel conversation
- create character dialogue for games, machinima, or visual novels
- generate narration with multiple takes, then compare and regenerate weak lines
- clean source clips before cloning so the voices sound more stable
- arrange interruptions, overlaps, and scene timing in a timeline before export
| What you want to do | Where you do it | Why it matters |
|---|---|---|
| Clean bad source audio before cloning | Speaker Prep |
Better prompt clips usually mean better voice match, less noise, and fewer robotic takes |
| Turn a script into multi-speaker dialogue quickly | Conversation Workflow |
Fastest path from script to first usable versions |
| Keep only the lines that actually sound right | Conversation Results |
Review, compare, edit, regenerate, and lock final takes before export |
| Shape interruptions, timing, and full scenes | Timeline Editor |
Useful when a plain linear conversation is not enough |
- upload or select raw source clips
- trim, convert to mono, normalize, and clean noisy audio
- run quick clone-readiness diagnostics
- save the improved result into the live speaker library
- paste a multi-speaker script
- see available voices clearly
- apply pacing presets
- parse, generate, and save project state
- compare versions line by line
- play clips, compare clips, and review scores
- edit text during review
- regenerate weak lines
- export only after every line has a chosen final take
- build a scene directly in the timeline
- add speaker tracks and segments
- move segments in time
- shape overlaps and interruptions
- preview and export the final arranged scene
| Speaker Prep | Conversation Workflow |
|---|---|
![]() |
![]() |
| Conversation Results | Timeline Editor |
|---|---|
![]() |
![]() |
This is the supported runtime path for this repo.
Default behavior:
- use your NVIDIA GPU when Docker can access it
- fall back to CPU only when GPU runtime is unavailable
- Put your model files in:
shared/models/checkpoints
- Start the app:
docker\start.batOr manually:
docker compose -f docker/docker-compose.yml -f docker/docker-compose.gpu.yml up -d --build- Open:
- Frontend UI: http://localhost:3000
- Backend API: http://localhost:8001
- API docs: http://localhost:8001/docs
- Add your own clips:
- finished cloning prompts ->
shared/audio/speakers/ - raw clips for prep ->
shared/audio/source_clips/
- Work through the app in this order:
Speaker PrepConversation WorkflowConversation ResultsTimeline Editor
To stop the stack:
docker\stop.batIf shared/models/checkpoints is empty, the backend can automatically download the official IndexTTS2 model bundle on first start.
If you prefer to download the models yourself instead of using the app's automatic bootstrap, use the official upstream links:
- IndexTTS-2 on Hugging Face: IndexTeam/IndexTTS-2
- IndexTTS-2 on ModelScope: IndexTTS-2
For this app, place the downloaded files in:
shared/models/checkpoints
The app is built around IndexTTS2. Older upstream model releases exist, but this repo's current workflow and docs are centered on the v2 model line.
- Docker-first startup with GPU-first behavior and CPU fallback
- DeepSpeed-enabled local inference path
- Speaker prep and diagnostics before cloning
- Multi-speaker script generation
- Review, regeneration, and final selection gating
- Timeline-based arrangement and scene export
- Project save/load for longer sessions
- Public demo scenes, screenshots, and short walkthrough videos
- Optional WebMCP website bridge for AI agents
- Start with the pinned support issue for common first-run help and runtime fixes
- Use the built-in GitHub issue templates for bug reports, setup help, and feature requests
- Read CONTRIBUTING.md if you want to send changes back
- Read SECURITY.md for responsible vulnerability reporting
The web UI remains the main interface for human users.
For agent-style use, the frontend now includes an optional WebMCP bridge that can expose studio tools, prompts, and resources from the running website when the WebMCP script is available.
This is intended as a lightweight AI integration layer on top of the existing web app and API, not as a replacement for the normal UI.
Practical guidance for a good local experience:
- NVIDIA GPU is strongly recommended
- CPU fallback works, but startup and generation are much slower
- at least
16 GB RAM,32 GBrecommended - allow
50 GB+disk space for models, caches, and outputs - the first DeepSpeed-enabled startup can take longer while extensions warm up or compile
Today the Docker image is NVIDIA/CUDA-based. AMD and Apple GPU paths are not supported by this Docker image.
More runtime details: docker/README.md
This app is intentionally BYO voice.
It does not include celebrity voices, personal voice libraries, or redistributable speaker packs.
Expected folders:
shared/audio/speakers/- live speaker prompt files used by the appshared/audio/source_clips/- raw clips for cleanup and preparationshared/audio/speakers_backups/- backups of original speaker files before replacement
If a voice sounds too fast, robotic, or less faithful than expected:
- use the
Clone Fidelitypreset in the UI - keep random sampling off
- use natural punctuation and sentence casing in the script
- prefer a clean
8 to 20 secondclip with one speaker and low background noise - use
Speaker Prepbefore blaming the model
There is also:
- a benchmark helper in backend/scripts/quality_benchmark.py
- a listening review format in docs/research/LISTENING_FEEDBACK_SYNTAX.md
- a scripting guide in docs/research/INDEXTTS2_SCRIPTING_PLAYBOOK.md
If you want a guided tour of the app before using it, start here:
- Full user manual with screenshots: docs/manual/USER_MANUAL.md
- Speaker Prep video: docs/assets/manual/videos/speaker-prep-tab.webm
- Conversation Workflow video: docs/assets/manual/videos/conversation-workflow-tab.webm
- Conversation Results video: docs/assets/manual/videos/conversation-results-tab.webm
- Timeline Editor video: docs/assets/manual/videos/timeline-editor-tab.webm
SpeakerOne: I think we should test three versions before we keep the final line.
SpeakerTwo: Good. If one sounds rushed, regenerate it and compare again.
SpeakerThree: After that, move the best takes into the timeline and export the scene.
shared/audio/speakers/- live speaker reference audio used for cloningshared/audio/source_clips/- raw clips for preparation or batch processingshared/audio/speakers_backups/- backups of original speaker filesshared/models/checkpoints/- IndexTTS model filesshared/audio/outputs/- exported outputsshared/audio/temp_conversation_segments/- per-line conversation audioshared/audio/uploads/- temporary imported filesshared/data/project_saves/- saved conversation projectsshared/data/timeline_projects/- saved timeline projectsfrontend/- browser UIbackend/- FastAPI app plus wrapped IndexTTS runtimedocs/- manuals, research notes, release docs, and supporting referencestools/- maintenance helpers, manual capture scripts, and debug utilitiesexamples/- reusable sample inputs and saved examples
If you want a CLI-style run without teaching users a host Python setup, use the backend container:
docker compose -f docker/docker-compose.yml exec backend python backend/indextts/cli.py "Your text here" -v /app/shared/audio/speakers/YourVoice.wav -o output.wav --model_dir /app/shared/models/checkpoints -c /app/shared/models/checkpoints/config.yaml- Docs index: docs/README.md
- User manual: docs/manual/USER_MANUAL.md
- Docker guide: docker/README.md
- Deployment guide: docs/deployment/DEPLOYMENT_GUIDE.md
- API summary: docs/api/API_README.md
- Known limitations: docs/project/KNOWN_LIMITATIONS.md
- Release readiness: docs/project/RELEASE_READINESS_STATUS.md
- Audio folder guide: shared/audio/README.md
If you want the shortest path from install to first result:
- Start the stack with
docker\start.bat - Open http://localhost:3000
- Prep or add a voice
- Generate a short conversation
- Review, regenerate, and export
The underlying model technology, papers, and official pretrained checkpoints belong to the IndexTTS team. This repository packages those models into a more workflow-focused local application.



