Skip to content

Add PCM voice pipeline: server-side STT/TTS, Smart Turn, web UI voice mode#971

Closed
pbranchu wants to merge 24 commits intoRightNow-AI:mainfrom
pbranchu:feat/voice-pcm
Closed

Add PCM voice pipeline: server-side STT/TTS, Smart Turn, web UI voice mode#971
pbranchu wants to merge 24 commits intoRightNow-AI:mainfrom
pbranchu:feat/voice-pcm

Conversation

@pbranchu
Copy link
Copy Markdown
Contributor

@pbranchu pbranchu commented Apr 3, 2026

⚠️ Dependencies — do not merge until both are on main

This branch was built by merging both feat/voice-adapter and feat/channel-system-prompts then adding the PCM commits on top. Once #798 and #876 land on main, this will be rebased and marked ready for review.

Summary

Builds on the text-based voice adapter (#798) to add a full server-side PCM pipeline:

  • PCM mode: accept raw 16kHz mono PCM audio from the client over the existing WebSocket; no client-side STT/TTS required
  • STT: Deepgram nova-3 via REST API (stt.rs)
  • TTS: Cartesia sonic-2 with sentence-by-sentence streaming (tts.rs)
  • Smart Turn v3.2: ONNX-based end-of-turn detection with silence-detection fallback if libonnxruntime is unavailable (smart_turn.rs)
  • Voice WebSocket on main port: /voice route merged into the main API router — no separate port required, single reverse proxy covers everything
  • Web UI voice mode: phone icon in chat header to start/end a call; live status bar (listening/thinking/speaking); dictation mic remains in compose toolbar
  • Permissions-Policy header: microphone=* added to webchat response so getUserMedia works on HTTPS

Voice channel system prompt (requires #876)

Once #876 is merged, configure the voice channel system prompt:

[channels.voice.system_prompt]
prepend = """
You are responding via voice. Keep answers short and conversational.
No bullet points, no markdown, no code blocks, no lists.
User messages may be prefixed with [From: Name] to identify the speaker.
Do NOT include this prefix in your own responses.
"""

Test plan

  • cargo build --workspace compiles cleanly
  • cargo clippy --workspace -- -D warnings passes
  • cargo fmt --check passes
  • Connect to ws://localhost:4200/voice, send PCM audio, verify STT transcription and TTS response
  • Voice WebSocket accessible at /voice on main API port (no separate port needed)
  • Web UI: phone icon in header starts voice mode; mic in compose toolbar for dictation
  • getUserMedia succeeds on HTTPS without SecurityError

🤖 Generated with Claude Code

Philippe Branchu and others added 13 commits April 4, 2026 11:27
Introduces a new voice channel adapter that provides a WebSocket server
for voice clients (mobile apps, Meet bots, web clients). Clients handle
STT/TTS directly; this adapter exchanges text with emotion tag parsing
and sentence-by-sentence delivery.

- New VoiceAdapter in openfang-channels with WebSocket protocol
- VoiceConfig in openfang-types with listen address and default agent
- Channel bridge registration and override wiring in openfang-api
- Emotion tag parsing (amused, concerned, formal, warm, etc.)
- Abbreviation-aware sentence splitting for natural TTS delivery
- 10 unit tests for parsing and adapter creation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Thread channel_system_prompt through the call chain from bridge dispatch
to PromptContext, avoiding global state and the associated race condition.
Remove system_prompt_replace (append-only behavior).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… UI voice mode

Extends the voice WebSocket adapter (PR #798) with a server-side audio
pipeline so thin clients (browser, mobile) can stream raw PCM and get
audio back without bundling STT/TTS SDKs.

## What's new

**PCM binary mode** — detected automatically by binary WebSocket frame;
text mode (PR #798 protocol) is unchanged and fully backward compatible.

**Smart Turn detection** — ONNX model (`ort` crate) with Whisper-compatible
mel spectrogram (`rustfft`). Detects end-of-utterance from prosody rather
than silence. Falls back to silence timeout if model not configured.

**STT providers** — Deepgram (nova-3) and OpenAI Whisper batch REST.
Accepts raw Int16 PCM, returns transcription string.

**TTS providers** — Cartesia (sonic-2 PCM streaming), ElevenLabs
(pcm_16000 output), OpenAI TTS (pcm, resampled 24→16 kHz).
Returns Int16 PCM at 16 kHz mono.

**Config** — all providers optional under `[channels.voice.stt]`,
`[channels.voice.tts]`, `[channels.voice.smart_turn]`.

**Web UI voice mode** — toggle button in chat toolbar activates PCM
WebSocket connection; AudioWorklet (ScriptProcessor fallback) captures
mic, AudioContext plays TTS responses. Status bar shows live state.

**Channel registry** — voice adapter added to the channels setup UI
with all provider fields visible in the configure dialog.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ort's load-dynamic feature panics (not returns Err) when the ONNX
Runtime shared library cannot be found. This panic propagated out of
SmartTurnDetector::load() through with_pipeline() and crashed the API
server before it could bind to its port.

Wrap Session::builder() in catch_unwind(AssertUnwindSafe(...)) so that
a missing libonnxruntime.so is treated as a graceful load failure: the
voice adapter starts without Smart Turn (using silence detection instead)
and logs a warning rather than crashing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SmartTurnDetector::load() blocks the Tokio async executor when called
on a worker thread. Move it to spawn_blocking so it runs on a dedicated
blocking thread. Wrap with a 15s timeout to fall back to silence
detection if the ONNX model load hangs (observed with ORT 1.20.1 and
the smart-turn-v3.2.onnx quantized model — likely a custom domain
incompatibility).

Also install libonnxruntime.so 1.20.1 in the Docker image and change
with_pipeline() to accept an already-loaded Option<SmartTurnDetector>
so the blocking load happens in the async caller.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Voice WebSocket (/voice) is now mounted on the main Axum router so it
is accessible through the same port as the REST API. A single reverse-
proxy rule now covers both REST and voice — no separate port 4201
exposure needed.

Changes:
- VoiceAdapter pre-creates its bridge channel in new(); exposes
  make_router() to return an Axum Router<()> with the /voice handler
- channel_bridge: calls make_router() before wrapping in Arc<dyn>,
  returns voice_router as part of the start_channel_bridge() result
- server: merges voice_router into the main Axum router after building
- chat.js: voice WebSocket URL now uses same-origin /voice path instead
  of hardcoded port 4201; falls back to window.OPENFANG_VOICE_URL for
  legacy deployments that still need a separate port
- index_body.html: voice conversation button now uses a headphones icon
  to distinguish it clearly from the dictation mic button

The standalone server on listen_addr (4201) still starts for backwards-
compatible direct access.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move voice call button from compose toolbar to chat header (alongside
  sessions, search, focus mode) — it's a mode switch, not a text input tool
- Replace headphones icon with phone icon to distinguish from dictation mic
- Add Permissions-Policy: microphone=* header to allow getUserMedia on HTTPS
  (Firefox throws SecurityError without explicit permissions policy)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- New ClientMessage::Hello { speaker } — client sends identity on session
  open before PCM streams; sets primary_speaker for the session
- UI sends Hello with sessionUser (or "User" if unauthenticated) on WebSocket
  open, guaranteed before first PCM frame
- All utterances prefixed with [From: Name] in both PCM and text modes
- PCM session now handles text control frames (Hello, Cancel, End) in the
  main select! loop — previously silently ignored
- Optional diarization via VoiceSttConfig.diarize (Deepgram only):
  speaker 0 maps to primary_speaker, others become "Speaker N"
- TranscriptResult enum replaces plain String return from stt::transcribe()
- OpenAI provider logs a warning and falls back to plain if diarize=true

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Two configurable RMS thresholds: barge_in_threshold (normal listening)
  and barge_in_speaking_threshold (elevated while agent speaks, to avoid
  echo triggering barge-in)
- Server: watch::channel cancel signal into TTS task; spoken sentences
  tracked and prepended as context on the follow-up utterance
- Client: worklet emits {pcm, rms}; RMS checked against threshold while
  voiceStatus==='speaking'; _triggerBargeIn() stops playback, clears
  queue, sends {type:'cancel'}; barge_in_ack resets pending flag
- Config frame from server sets _bargeInSpeakingThreshold on connect

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AudioWorklet modules are loaded via URL.createObjectURL(new Blob([...]))
which produces a blob: URL. Without blob: in script-src the browser
blocks the module load and throws SecurityError: Not allowed by CSP,
preventing voice mic capture from starting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…voiceTranscript Alpine expression

- Add blob: to script-src for AudioWorklet blob URL loading
- Add nonce to service worker inline script
- Fix voiceTranscript x-text: use &quot; entity instead of \" (HTML parser
  was ending the attribute at the unescaped double quote, breaking Alpine)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion)

sleep(duration) in a select! loop resets on every iteration — so with the
worklet sending a frame every ~8ms, the 1200ms silence timer could never
fire. The fallback never dispatched any utterances.

Fix: track silence_fire_at as an absolute Instant, pushed forward only
when RMS > SILENCE_SPEECH_THRESHOLD (300 Int16 units). sleep_until fires
at the correct wall-clock time regardless of how many iterations pass.
Added rms_i16() helper for per-chunk amplitude measurement.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Philippe Branchu and others added 11 commits April 4, 2026 13:37
After rebase onto upstream main, mcp.rs was our hand-rolled version while
kernel.rs expected rmcp's McpTransport::Http variant and headers field.
Restore upstream's rmcp-based mcp.rs. Fix StreamableHttpClientTransportConfig
construction which became #[non_exhaustive] in rmcp 1.3 — use Default then
field mutation instead of struct literal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mode

The client sends Hello (text) then binary PCM. The server peeked at the
first frame to decide mode — seeing text, it stayed in text mode and
ignored all subsequent binary frames.

Fix: when the first frame is a Hello and a pipeline is configured, read
one more frame; if binary, enter PCM mode with the pre-parsed speaker
name passed directly to handle_pcm_session so identity isn't lost.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- far_future sentinel + speech_ever_detected flag prevents silence timer
  from firing before any speech is detected
- Multiple session guard: check WebSocket readyState at start of
  _startVoiceMode, return if OPEN or CONNECTING
- Response text: channel changed to (Vec<i16>, String), sends
  {"type":"response","text":"...","sentence_end":true} JSON frame
  before binary PCM so transcript is populated
- Add Dockerfile.dev + build-dev.sh for fast incremental dev builds

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- voice.rs: capture display_text (clean STT) separately from final_text
  (which has [From: Name] prefix + barge-in context for the agent).
  Send both in transcribed message: {text, display_text}.
- chat.js: use msg.display_text for the chat bubble and voiceTranscript
  status bar preview — shows clean spoken words, not internal prefixes.
- Use messages.concat([...]) instead of push/splice to guarantee
  Alpine.js reactivity for both transcribed and response handlers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the user speaks multiple things before the agent responds, the old
code dispatched each utterance independently and the agent answered them
all sequentially. New behavior:

- First utterance dispatched immediately, waiting_for_response = true
- Any subsequent speech while waiting is buffered in pending_inputs
- When the first response arrives and pending_inputs is non-empty:
  - Discard the stale response (cancel TTS, drain queue)
  - Combine all pending inputs into one request and restart
- On barge-in (user speaks during TTS): clear pending_inputs too

Refactored dispatch_pcm_utterance into:
- transcribe_utterance: STT + client notify (always runs)
- send_utterance_to_agent: bridge send (only when dispatching)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…king

When the user speaks while the agent is mid-inference, we discard the
stale TTS response and send combined pending inputs instead. Now we
prefix the combined prompt with the discarded response text so the model
knows it answered but the user never heard it:

  [You had responded: "..." but the user spoke again before hearing it]
  [From: User] new input 1
  [From: User] new input 2

This prevents the model from building on an answer the user never heard.

Proper fix (cancelling the in-flight agent loop before the response
enters conversation history) tracked in RightNow-AI/openfang#974.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Consistent with all other OpenFang provider configs. api_key fields on
VoiceSttConfig and VoiceTtsConfig renamed to api_key_env; resolved at
call site via std::env::var. Remove our-specific defaults (port 4201,
smart-turn model path). Generic agent names in tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fetch /api/config on init; expose voice_pcm_enabled flag (true when both
STT and TTS providers are configured). Phone button is always visible but
clicking when voice is not configured shows a system message explaining
how to enable it rather than attempting a failing WebSocket connection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove unused SmartTurnConfig import
- Remove speech_ever_detected variable (assigned but never read; far_future already gates silence dispatch)
- Replace map(|tx| tx.clone()) with .cloned()
- Replace drain(..).collect() with std::mem::take()
- Add #[allow(dead_code)] on Config server message variant (part of API spec)
- Simplify byte_rate constant (16_000 * 1 * 2 → 16_000 * 2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pbranchu pbranchu closed this Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant