Skip to content

feat(tts): integrate Gemini API text-to-speech#85

Open
RayJiang4S wants to merge 2 commits into
calesthio:mainfrom
RayJiang4S:ray/google-tts-provider-upstream
Open

feat(tts): integrate Gemini API text-to-speech#85
RayJiang4S wants to merge 2 commits into
calesthio:mainfrom
RayJiang4S:ray/google-tts-provider-upstream

Conversation

@RayJiang4S

Copy link
Copy Markdown

Summary

  • Replace the Google TTS provider with Gemini API text-to-speech (gemini-3.1-flash-tts-preview)
  • Support API-key auth via GEMINI_API_KEY / GOOGLE_API_KEY
  • Add prompt-directed single-speaker and two-speaker payload support, writing Gemini PCM output as WAV
  • Update provider docs, environment examples, and contract coverage for the Gemini API workflow

Why

Google's Gemini API TTS docs show gemini-3.1-flash-tts-preview as a supported TTS model with responseModalities: [\"AUDIO\"], prebuilt voice config, 24 kHz PCM output, and single/multi-speaker support. This keeps OpenMontage aligned with the current Google TTS path while preserving the existing provider contract and registry discovery style.

Verification

  • Real smoke test passed with gemini-3.1-flash-tts-preview / Kore, producing a 24 kHz mono WAV (5.24s)
  • .venv/bin/python -m pytest tests/contracts/test_phase3_contracts.py -q -> 70 passed
  • .venv/bin/python -m compileall tools/audio/google_tts.py tests/contracts/test_phase3_contracts.py -> passed
  • git diff --check -> passed

Replace the Google TTS provider with the latest Gemini API TTS path, including prompt-directed single-speaker and two-speaker audio generation. Update provider docs, environment examples, and contract coverage for the new API-key based workflow.
@RayJiang4S RayJiang4S requested a review from calesthio as a code owner May 22, 2026 05:08
Expose reusable Google TTS delivery presets and duration target guidance so narration auditions can carry stable style and pacing instructions into Gemini prompt-directed synthesis.
@RayJiang4S

Copy link
Copy Markdown
Author

Updated this PR with the Google TTS workflow learnings from a real narration audition pass.

What changed:

  • Added delivery_preset support to google_tts for reusable Gemini prompt-directed narration styles:
    • technical_briefing
    • compact_explainer
    • warm_opening
    • clear_cta
  • Added duration_target_seconds as explicit prompt guidance for approximate timing. This does not pretend Gemini exposes a hard speed parameter; it records timing intent in the synthesis prompt and output metadata.
  • Returned delivery_preset and duration_target_seconds in provider output metadata for auditability.
  • Documented the recommended preset + duration workflow in docs/PROVIDERS.md.
  • Added contract coverage for prompt generation and invalid preset/duration validation.

Validation:

  • .venv/bin/python -m pytest tests/contracts/test_phase3_contracts.py -q passed: 73 tests.
  • .venv/bin/python -m py_compile tools/audio/google_tts.py tests/contracts/test_phase3_contracts.py passed.
  • git diff --check passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant