Add PCM voice pipeline: server-side STT/TTS, Smart Turn, web UI voice mode by pbranchu · Pull Request #971 · RightNow-AI/openfang

pbranchu · 2026-04-03T15:26:57Z

⚠️ Dependencies — do not merge until both are on main

Add voice channel adapter with WebSocket server #798 — Add voice channel adapter (base WebSocket protocol)
Add per-channel system prompt injection #876 — Add per-channel system prompt injection (required for voice-appropriate agent behaviour)

This branch was built by merging both feat/voice-adapter and feat/channel-system-prompts then adding the PCM commits on top. Once #798 and #876 land on main, this will be rebased and marked ready for review.

Summary

Builds on the text-based voice adapter (#798) to add a full server-side PCM pipeline:

PCM mode: accept raw 16kHz mono PCM audio from the client over the existing WebSocket; no client-side STT/TTS required
STT: Deepgram nova-3 via REST API (stt.rs)
TTS: Cartesia sonic-2 with sentence-by-sentence streaming (tts.rs)
Smart Turn v3.2: ONNX-based end-of-turn detection with silence-detection fallback if libonnxruntime is unavailable (smart_turn.rs)
Voice WebSocket on main port: /voice route merged into the main API router — no separate port required, single reverse proxy covers everything
Web UI voice mode: phone icon in chat header to start/end a call; live status bar (listening/thinking/speaking); dictation mic remains in compose toolbar
Permissions-Policy header: microphone=* added to webchat response so getUserMedia works on HTTPS

Voice channel system prompt (requires #876)

Once #876 is merged, configure the voice channel system prompt:

[channels.voice.system_prompt]
prepend = """
You are responding via voice. Keep answers short and conversational.
No bullet points, no markdown, no code blocks, no lists.
User messages may be prefixed with [From: Name] to identify the speaker.
Do NOT include this prefix in your own responses.
"""

Test plan

cargo build --workspace compiles cleanly
cargo clippy --workspace -- -D warnings passes
cargo fmt --check passes
Connect to ws://localhost:4200/voice, send PCM audio, verify STT transcription and TTS response
Voice WebSocket accessible at /voice on main API port (no separate port needed)
Web UI: phone icon in header starts voice mode; mic in compose toolbar for dictation
getUserMedia succeeds on HTTPS without SecurityError

🤖 Generated with Claude Code

Introduces a new voice channel adapter that provides a WebSocket server for voice clients (mobile apps, Meet bots, web clients). Clients handle STT/TTS directly; this adapter exchanges text with emotion tag parsing and sentence-by-sentence delivery. - New VoiceAdapter in openfang-channels with WebSocket protocol - VoiceConfig in openfang-types with listen address and default agent - Channel bridge registration and override wiring in openfang-api - Emotion tag parsing (amused, concerned, formal, warm, etc.) - Abbreviation-aware sentence splitting for natural TTS delivery - 10 unit tests for parsing and adapter creation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Thread channel_system_prompt through the call chain from bridge dispatch to PromptContext, avoiding global state and the associated race condition. Remove system_prompt_replace (append-only behavior). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… UI voice mode Extends the voice WebSocket adapter (PR #798) with a server-side audio pipeline so thin clients (browser, mobile) can stream raw PCM and get audio back without bundling STT/TTS SDKs. ## What's new **PCM binary mode** — detected automatically by binary WebSocket frame; text mode (PR #798 protocol) is unchanged and fully backward compatible. **Smart Turn detection** — ONNX model (`ort` crate) with Whisper-compatible mel spectrogram (`rustfft`). Detects end-of-utterance from prosody rather than silence. Falls back to silence timeout if model not configured. **STT providers** — Deepgram (nova-3) and OpenAI Whisper batch REST. Accepts raw Int16 PCM, returns transcription string. **TTS providers** — Cartesia (sonic-2 PCM streaming), ElevenLabs (pcm_16000 output), OpenAI TTS (pcm, resampled 24→16 kHz). Returns Int16 PCM at 16 kHz mono. **Config** — all providers optional under `[channels.voice.stt]`, `[channels.voice.tts]`, `[channels.voice.smart_turn]`. **Web UI voice mode** — toggle button in chat toolbar activates PCM WebSocket connection; AudioWorklet (ScriptProcessor fallback) captures mic, AudioContext plays TTS responses. Status bar shows live state. **Channel registry** — voice adapter added to the channels setup UI with all provider fields visible in the configure dialog. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ort's load-dynamic feature panics (not returns Err) when the ONNX Runtime shared library cannot be found. This panic propagated out of SmartTurnDetector::load() through with_pipeline() and crashed the API server before it could bind to its port. Wrap Session::builder() in catch_unwind(AssertUnwindSafe(...)) so that a missing libonnxruntime.so is treated as a graceful load failure: the voice adapter starts without Smart Turn (using silence detection instead) and logs a warning rather than crashing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

SmartTurnDetector::load() blocks the Tokio async executor when called on a worker thread. Move it to spawn_blocking so it runs on a dedicated blocking thread. Wrap with a 15s timeout to fall back to silence detection if the ONNX model load hangs (observed with ORT 1.20.1 and the smart-turn-v3.2.onnx quantized model — likely a custom domain incompatibility). Also install libonnxruntime.so 1.20.1 in the Docker image and change with_pipeline() to accept an already-loaded Option<SmartTurnDetector> so the blocking load happens in the async caller. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Voice WebSocket (/voice) is now mounted on the main Axum router so it is accessible through the same port as the REST API. A single reverse- proxy rule now covers both REST and voice — no separate port 4201 exposure needed. Changes: - VoiceAdapter pre-creates its bridge channel in new(); exposes make_router() to return an Axum Router<()> with the /voice handler - channel_bridge: calls make_router() before wrapping in Arc<dyn>, returns voice_router as part of the start_channel_bridge() result - server: merges voice_router into the main Axum router after building - chat.js: voice WebSocket URL now uses same-origin /voice path instead of hardcoded port 4201; falls back to window.OPENFANG_VOICE_URL for legacy deployments that still need a separate port - index_body.html: voice conversation button now uses a headphones icon to distinguish it clearly from the dictation mic button The standalone server on listen_addr (4201) still starts for backwards- compatible direct access. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Move voice call button from compose toolbar to chat header (alongside sessions, search, focus mode) — it's a mode switch, not a text input tool - Replace headphones icon with phone icon to distinguish from dictation mic - Add Permissions-Policy: microphone=* header to allow getUserMedia on HTTPS (Firefox throws SecurityError without explicit permissions policy) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- New ClientMessage::Hello { speaker } — client sends identity on session open before PCM streams; sets primary_speaker for the session - UI sends Hello with sessionUser (or "User" if unauthenticated) on WebSocket open, guaranteed before first PCM frame - All utterances prefixed with [From: Name] in both PCM and text modes - PCM session now handles text control frames (Hello, Cancel, End) in the main select! loop — previously silently ignored - Optional diarization via VoiceSttConfig.diarize (Deepgram only): speaker 0 maps to primary_speaker, others become "Speaker N" - TranscriptResult enum replaces plain String return from stt::transcribe() - OpenAI provider logs a warning and falls back to plain if diarize=true Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Two configurable RMS thresholds: barge_in_threshold (normal listening) and barge_in_speaking_threshold (elevated while agent speaks, to avoid echo triggering barge-in) - Server: watch::channel cancel signal into TTS task; spoken sentences tracked and prepended as context on the follow-up utterance - Client: worklet emits {pcm, rms}; RMS checked against threshold while voiceStatus==='speaking'; _triggerBargeIn() stops playback, clears queue, sends {type:'cancel'}; barge_in_ack resets pending flag - Config frame from server sets _bargeInSpeakingThreshold on connect Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

AudioWorklet modules are loaded via URL.createObjectURL(new Blob([...])) which produces a blob: URL. Without blob: in script-src the browser blocks the module load and throws SecurityError: Not allowed by CSP, preventing voice mic capture from starting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…voiceTranscript Alpine expression - Add blob: to script-src for AudioWorklet blob URL loading - Add nonce to service worker inline script - Fix voiceTranscript x-text: use " entity instead of \" (HTML parser was ending the attribute at the unescaped double quote, breaking Alpine) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tion) sleep(duration) in a select! loop resets on every iteration — so with the worklet sending a frame every ~8ms, the 1200ms silence timer could never fire. The fallback never dispatched any utterances. Fix: track silence_fire_at as an absolute Instant, pushed forward only when RMS > SILENCE_SPEECH_THRESHOLD (300 Int16 units). sleep_until fires at the correct wall-clock time regardless of how many iterations pass. Added rms_i16() helper for per-chunk amplitude measurement. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

After rebase onto upstream main, mcp.rs was our hand-rolled version while kernel.rs expected rmcp's McpTransport::Http variant and headers field. Restore upstream's rmcp-based mcp.rs. Fix StreamableHttpClientTransportConfig construction which became #[non_exhaustive] in rmcp 1.3 — use Default then field mutation instead of struct literal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mode The client sends Hello (text) then binary PCM. The server peeked at the first frame to decide mode — seeing text, it stayed in text mode and ignored all subsequent binary frames. Fix: when the first frame is a Hello and a pipeline is configured, read one more frame; if binary, enter PCM mode with the pre-parsed speaker name passed directly to handle_pcm_session so identity isn't lost. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- far_future sentinel + speech_ever_detected flag prevents silence timer from firing before any speech is detected - Multiple session guard: check WebSocket readyState at start of _startVoiceMode, return if OPEN or CONNECTING - Response text: channel changed to (Vec<i16>, String), sends {"type":"response","text":"...","sentence_end":true} JSON frame before binary PCM so transcript is populated - Add Dockerfile.dev + build-dev.sh for fast incremental dev builds Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- voice.rs: capture display_text (clean STT) separately from final_text (which has [From: Name] prefix + barge-in context for the agent). Send both in transcribed message: {text, display_text}. - chat.js: use msg.display_text for the chat bubble and voiceTranscript status bar preview — shows clean spoken words, not internal prefixes. - Use messages.concat([...]) instead of push/splice to guarantee Alpine.js reactivity for both transcribed and response handlers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When the user speaks multiple things before the agent responds, the old code dispatched each utterance independently and the agent answered them all sequentially. New behavior: - First utterance dispatched immediately, waiting_for_response = true - Any subsequent speech while waiting is buffered in pending_inputs - When the first response arrives and pending_inputs is non-empty: - Discard the stale response (cancel TTS, drain queue) - Combine all pending inputs into one request and restart - On barge-in (user speaks during TTS): clear pending_inputs too Refactored dispatch_pcm_utterance into: - transcribe_utterance: STT + client notify (always runs) - send_utterance_to_agent: bridge send (only when dispatching) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…king When the user speaks while the agent is mid-inference, we discard the stale TTS response and send combined pending inputs instead. Now we prefix the combined prompt with the discarded response text so the model knows it answered but the user never heard it: [You had responded: "..." but the user spoke again before hearing it] [From: User] new input 1 [From: User] new input 2 This prevents the model from building on an answer the user never heard. Proper fix (cancelling the in-flight agent loop before the response enters conversation history) tracked in RightNow-AI/openfang#974. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Consistent with all other OpenFang provider configs. api_key fields on VoiceSttConfig and VoiceTtsConfig renamed to api_key_env; resolved at call site via std::env::var. Remove our-specific defaults (port 4201, smart-turn model path). Generic agent names in tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fetch /api/config on init; expose voice_pcm_enabled flag (true when both STT and TTS providers are configured). Phone button is always visible but clicking when voice is not configured shows a system message explaining how to enable it rather than attempting a failing WebSocket connection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Remove unused SmartTurnConfig import - Remove speech_ever_detected variable (assigned but never read; far_future already gates silence dispatch) - Replace map(|tx| tx.clone()) with .cloned() - Replace drain(..).collect() with std::mem::take() - Add #[allow(dead_code)] on Config server message variant (part of API spec) - Simplify byte_rate constant (16_000 * 1 * 2 → 16_000 * 2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Philippe Branchu and others added 13 commits April 4, 2026 11:27

Fix cargo fmt formatting

4dd628b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pbranchu force-pushed the feat/voice-pcm branch from bbb61df to 9c12be2 Compare April 4, 2026 11:27

Philippe Branchu and others added 11 commits April 4, 2026 13:37

cargo fmt

c435d0e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove personal name from doc comment example

23641cc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pbranchu closed this Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PCM voice pipeline: server-side STT/TTS, Smart Turn, web UI voice mode#971

Add PCM voice pipeline: server-side STT/TTS, Smart Turn, web UI voice mode#971
pbranchu wants to merge 24 commits intoRightNow-AI:mainfrom
pbranchu:feat/voice-pcm

pbranchu commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pbranchu commented Apr 3, 2026

⚠️ Dependencies — do not merge until both are on main

Summary

Voice channel system prompt (requires #876)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant