Add PCM voice pipeline: server-side STT/TTS, Smart Turn, web UI voice mode#971
Closed
pbranchu wants to merge 24 commits intoRightNow-AI:mainfrom
Closed
Add PCM voice pipeline: server-side STT/TTS, Smart Turn, web UI voice mode#971pbranchu wants to merge 24 commits intoRightNow-AI:mainfrom
pbranchu wants to merge 24 commits intoRightNow-AI:mainfrom
Conversation
Introduces a new voice channel adapter that provides a WebSocket server for voice clients (mobile apps, Meet bots, web clients). Clients handle STT/TTS directly; this adapter exchanges text with emotion tag parsing and sentence-by-sentence delivery. - New VoiceAdapter in openfang-channels with WebSocket protocol - VoiceConfig in openfang-types with listen address and default agent - Channel bridge registration and override wiring in openfang-api - Emotion tag parsing (amused, concerned, formal, warm, etc.) - Abbreviation-aware sentence splitting for natural TTS delivery - 10 unit tests for parsing and adapter creation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Thread channel_system_prompt through the call chain from bridge dispatch to PromptContext, avoiding global state and the associated race condition. Remove system_prompt_replace (append-only behavior). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… UI voice mode Extends the voice WebSocket adapter (PR #798) with a server-side audio pipeline so thin clients (browser, mobile) can stream raw PCM and get audio back without bundling STT/TTS SDKs. ## What's new **PCM binary mode** — detected automatically by binary WebSocket frame; text mode (PR #798 protocol) is unchanged and fully backward compatible. **Smart Turn detection** — ONNX model (`ort` crate) with Whisper-compatible mel spectrogram (`rustfft`). Detects end-of-utterance from prosody rather than silence. Falls back to silence timeout if model not configured. **STT providers** — Deepgram (nova-3) and OpenAI Whisper batch REST. Accepts raw Int16 PCM, returns transcription string. **TTS providers** — Cartesia (sonic-2 PCM streaming), ElevenLabs (pcm_16000 output), OpenAI TTS (pcm, resampled 24→16 kHz). Returns Int16 PCM at 16 kHz mono. **Config** — all providers optional under `[channels.voice.stt]`, `[channels.voice.tts]`, `[channels.voice.smart_turn]`. **Web UI voice mode** — toggle button in chat toolbar activates PCM WebSocket connection; AudioWorklet (ScriptProcessor fallback) captures mic, AudioContext plays TTS responses. Status bar shows live state. **Channel registry** — voice adapter added to the channels setup UI with all provider fields visible in the configure dialog. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ort's load-dynamic feature panics (not returns Err) when the ONNX Runtime shared library cannot be found. This panic propagated out of SmartTurnDetector::load() through with_pipeline() and crashed the API server before it could bind to its port. Wrap Session::builder() in catch_unwind(AssertUnwindSafe(...)) so that a missing libonnxruntime.so is treated as a graceful load failure: the voice adapter starts without Smart Turn (using silence detection instead) and logs a warning rather than crashing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SmartTurnDetector::load() blocks the Tokio async executor when called on a worker thread. Move it to spawn_blocking so it runs on a dedicated blocking thread. Wrap with a 15s timeout to fall back to silence detection if the ONNX model load hangs (observed with ORT 1.20.1 and the smart-turn-v3.2.onnx quantized model — likely a custom domain incompatibility). Also install libonnxruntime.so 1.20.1 in the Docker image and change with_pipeline() to accept an already-loaded Option<SmartTurnDetector> so the blocking load happens in the async caller. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Voice WebSocket (/voice) is now mounted on the main Axum router so it is accessible through the same port as the REST API. A single reverse- proxy rule now covers both REST and voice — no separate port 4201 exposure needed. Changes: - VoiceAdapter pre-creates its bridge channel in new(); exposes make_router() to return an Axum Router<()> with the /voice handler - channel_bridge: calls make_router() before wrapping in Arc<dyn>, returns voice_router as part of the start_channel_bridge() result - server: merges voice_router into the main Axum router after building - chat.js: voice WebSocket URL now uses same-origin /voice path instead of hardcoded port 4201; falls back to window.OPENFANG_VOICE_URL for legacy deployments that still need a separate port - index_body.html: voice conversation button now uses a headphones icon to distinguish it clearly from the dictation mic button The standalone server on listen_addr (4201) still starts for backwards- compatible direct access. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move voice call button from compose toolbar to chat header (alongside sessions, search, focus mode) — it's a mode switch, not a text input tool - Replace headphones icon with phone icon to distinguish from dictation mic - Add Permissions-Policy: microphone=* header to allow getUserMedia on HTTPS (Firefox throws SecurityError without explicit permissions policy) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- New ClientMessage::Hello { speaker } — client sends identity on session
open before PCM streams; sets primary_speaker for the session
- UI sends Hello with sessionUser (or "User" if unauthenticated) on WebSocket
open, guaranteed before first PCM frame
- All utterances prefixed with [From: Name] in both PCM and text modes
- PCM session now handles text control frames (Hello, Cancel, End) in the
main select! loop — previously silently ignored
- Optional diarization via VoiceSttConfig.diarize (Deepgram only):
speaker 0 maps to primary_speaker, others become "Speaker N"
- TranscriptResult enum replaces plain String return from stt::transcribe()
- OpenAI provider logs a warning and falls back to plain if diarize=true
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Two configurable RMS thresholds: barge_in_threshold (normal listening)
and barge_in_speaking_threshold (elevated while agent speaks, to avoid
echo triggering barge-in)
- Server: watch::channel cancel signal into TTS task; spoken sentences
tracked and prepended as context on the follow-up utterance
- Client: worklet emits {pcm, rms}; RMS checked against threshold while
voiceStatus==='speaking'; _triggerBargeIn() stops playback, clears
queue, sends {type:'cancel'}; barge_in_ack resets pending flag
- Config frame from server sets _bargeInSpeakingThreshold on connect
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AudioWorklet modules are loaded via URL.createObjectURL(new Blob([...])) which produces a blob: URL. Without blob: in script-src the browser blocks the module load and throws SecurityError: Not allowed by CSP, preventing voice mic capture from starting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…voiceTranscript Alpine expression - Add blob: to script-src for AudioWorklet blob URL loading - Add nonce to service worker inline script - Fix voiceTranscript x-text: use " entity instead of \" (HTML parser was ending the attribute at the unescaped double quote, breaking Alpine) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion) sleep(duration) in a select! loop resets on every iteration — so with the worklet sending a frame every ~8ms, the 1200ms silence timer could never fire. The fallback never dispatched any utterances. Fix: track silence_fire_at as an absolute Instant, pushed forward only when RMS > SILENCE_SPEECH_THRESHOLD (300 Int16 units). sleep_until fires at the correct wall-clock time regardless of how many iterations pass. Added rms_i16() helper for per-chunk amplitude measurement. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After rebase onto upstream main, mcp.rs was our hand-rolled version while kernel.rs expected rmcp's McpTransport::Http variant and headers field. Restore upstream's rmcp-based mcp.rs. Fix StreamableHttpClientTransportConfig construction which became #[non_exhaustive] in rmcp 1.3 — use Default then field mutation instead of struct literal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mode The client sends Hello (text) then binary PCM. The server peeked at the first frame to decide mode — seeing text, it stayed in text mode and ignored all subsequent binary frames. Fix: when the first frame is a Hello and a pipeline is configured, read one more frame; if binary, enter PCM mode with the pre-parsed speaker name passed directly to handle_pcm_session so identity isn't lost. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- far_future sentinel + speech_ever_detected flag prevents silence timer
from firing before any speech is detected
- Multiple session guard: check WebSocket readyState at start of
_startVoiceMode, return if OPEN or CONNECTING
- Response text: channel changed to (Vec<i16>, String), sends
{"type":"response","text":"...","sentence_end":true} JSON frame
before binary PCM so transcript is populated
- Add Dockerfile.dev + build-dev.sh for fast incremental dev builds
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- voice.rs: capture display_text (clean STT) separately from final_text
(which has [From: Name] prefix + barge-in context for the agent).
Send both in transcribed message: {text, display_text}.
- chat.js: use msg.display_text for the chat bubble and voiceTranscript
status bar preview — shows clean spoken words, not internal prefixes.
- Use messages.concat([...]) instead of push/splice to guarantee
Alpine.js reactivity for both transcribed and response handlers.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the user speaks multiple things before the agent responds, the old code dispatched each utterance independently and the agent answered them all sequentially. New behavior: - First utterance dispatched immediately, waiting_for_response = true - Any subsequent speech while waiting is buffered in pending_inputs - When the first response arrives and pending_inputs is non-empty: - Discard the stale response (cancel TTS, drain queue) - Combine all pending inputs into one request and restart - On barge-in (user speaks during TTS): clear pending_inputs too Refactored dispatch_pcm_utterance into: - transcribe_utterance: STT + client notify (always runs) - send_utterance_to_agent: bridge send (only when dispatching) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…king When the user speaks while the agent is mid-inference, we discard the stale TTS response and send combined pending inputs instead. Now we prefix the combined prompt with the discarded response text so the model knows it answered but the user never heard it: [You had responded: "..." but the user spoke again before hearing it] [From: User] new input 1 [From: User] new input 2 This prevents the model from building on an answer the user never heard. Proper fix (cancelling the in-flight agent loop before the response enters conversation history) tracked in RightNow-AI/openfang#974. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Consistent with all other OpenFang provider configs. api_key fields on VoiceSttConfig and VoiceTtsConfig renamed to api_key_env; resolved at call site via std::env::var. Remove our-specific defaults (port 4201, smart-turn model path). Generic agent names in tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fetch /api/config on init; expose voice_pcm_enabled flag (true when both STT and TTS providers are configured). Phone button is always visible but clicking when voice is not configured shows a system message explaining how to enable it rather than attempting a failing WebSocket connection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove unused SmartTurnConfig import - Remove speech_ever_detected variable (assigned but never read; far_future already gates silence dispatch) - Replace map(|tx| tx.clone()) with .cloned() - Replace drain(..).collect() with std::mem::take() - Add #[allow(dead_code)] on Config server message variant (part of API spec) - Simplify byte_rate constant (16_000 * 1 * 2 → 16_000 * 2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This branch was built by merging both
feat/voice-adapterandfeat/channel-system-promptsthen adding the PCM commits on top. Once #798 and #876 land on main, this will be rebased and marked ready for review.Summary
Builds on the text-based voice adapter (#798) to add a full server-side PCM pipeline:
stt.rs)tts.rs)smart_turn.rs)/voiceroute merged into the main API router — no separate port required, single reverse proxy covers everythingmicrophone=*added to webchat response sogetUserMediaworks on HTTPSVoice channel system prompt (requires #876)
Once #876 is merged, configure the voice channel system prompt:
Test plan
cargo build --workspacecompiles cleanlycargo clippy --workspace -- -D warningspassescargo fmt --checkpassesws://localhost:4200/voice, send PCM audio, verify STT transcription and TTS response/voiceon main API port (no separate port needed)getUserMediasucceeds on HTTPS without SecurityError🤖 Generated with Claude Code