feat(tts): integrate Gemini API text-to-speech by RayJiang4S · Pull Request #85 · calesthio/OpenMontage

RayJiang4S · 2026-05-22T05:08:15Z

Summary

Replace the Google TTS provider with Gemini API text-to-speech (gemini-3.1-flash-tts-preview)
Support API-key auth via GEMINI_API_KEY / GOOGLE_API_KEY
Add prompt-directed single-speaker and two-speaker payload support, writing Gemini PCM output as WAV
Update provider docs, environment examples, and contract coverage for the Gemini API workflow

Why

Google's Gemini API TTS docs show gemini-3.1-flash-tts-preview as a supported TTS model with responseModalities: [\"AUDIO\"], prebuilt voice config, 24 kHz PCM output, and single/multi-speaker support. This keeps OpenMontage aligned with the current Google TTS path while preserving the existing provider contract and registry discovery style.

Verification

Real smoke test passed with gemini-3.1-flash-tts-preview / Kore, producing a 24 kHz mono WAV (5.24s)
.venv/bin/python -m pytest tests/contracts/test_phase3_contracts.py -q -> 70 passed
.venv/bin/python -m compileall tools/audio/google_tts.py tests/contracts/test_phase3_contracts.py -> passed
git diff --check -> passed

Replace the Google TTS provider with the latest Gemini API TTS path, including prompt-directed single-speaker and two-speaker audio generation. Update provider docs, environment examples, and contract coverage for the new API-key based workflow.

Expose reusable Google TTS delivery presets and duration target guidance so narration auditions can carry stable style and pacing instructions into Gemini prompt-directed synthesis.

RayJiang4S · 2026-05-22T17:04:25Z

Updated this PR with the Google TTS workflow learnings from a real narration audition pass.

What changed:

Added delivery_preset support to google_tts for reusable Gemini prompt-directed narration styles:
- technical_briefing
- compact_explainer
- warm_opening
- clear_cta
Added duration_target_seconds as explicit prompt guidance for approximate timing. This does not pretend Gemini exposes a hard speed parameter; it records timing intent in the synthesis prompt and output metadata.
Returned delivery_preset and duration_target_seconds in provider output metadata for auditability.
Documented the recommended preset + duration workflow in docs/PROVIDERS.md.
Added contract coverage for prompt generation and invalid preset/duration validation.

Validation:

.venv/bin/python -m pytest tests/contracts/test_phase3_contracts.py -q passed: 73 tests.
.venv/bin/python -m py_compile tools/audio/google_tts.py tests/contracts/test_phase3_contracts.py passed.
git diff --check passed.

RayJiang4S requested a review from calesthio as a code owner May 22, 2026 05:08

feat(tts): add Gemini delivery presets

4dddd8a

Expose reusable Google TTS delivery presets and duration target guidance so narration auditions can carry stable style and pacing instructions into Gemini prompt-directed synthesis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tts): integrate Gemini API text-to-speech#85

feat(tts): integrate Gemini API text-to-speech#85
RayJiang4S wants to merge 2 commits into
calesthio:mainfrom
RayJiang4S:ray/google-tts-provider-upstream

RayJiang4S commented May 22, 2026

Uh oh!

RayJiang4S commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RayJiang4S commented May 22, 2026

Summary

Why

Verification

Uh oh!

RayJiang4S commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant