A conversational agent with memory. You can interrupt it mid-sentence and it stops talking. About 1.9 s
to first audio on an RTX 4070 SUPER, llama-server + XTTS streaming.1
Local-first. No remote API keys; the only HTTP endpoints it talks to
are on 127.0.0.1.
- Full-duplex turn-taking with a real barge-in cancellation chain. You speak over the agent, the agent stops mid-word, the streaming TTS gets cancelled and the browser flushes its playback buffer. Documented in the "Runtime invariants" section below.
- Pluggable VAD / STT / LLM / TTS via Python entry points. Adding a
provider is one Protocol implementation plus an entry-point line in
pyproject.toml. No core changes. - Three-layer memory: rolling buffer (last ~30 seconds), SQLite RAG with embeddings (semantic recall across sessions), curated profile loaded at boot.
- Browser front-end over HTTPS + WebSocket on your LAN. Your phone is the mic and the speaker; no app install.
See SETUP.md for install. Once installed, two terminals:
# Terminal 1
./scripts/start_llm_server.sh
# Terminal 2
python cli.py
Open the https://0.0.0.0:8443/ URL it prints, accept the self-signed
cert, tap the circle, talk.
Two processes, both local:
llama-serveron port 8080. Chat LLM. Built from source againstllama.cppso it can use--n-cpu-moe 48, which keeps the 30B MoE expert tensors in CPU RAM and the GPU under 4 GB. About a 5-minute compile. SETUP.md walks through it.- Ollama daemon on port 11434. Embedder only, serving
nomic-embed-textfor RAG memory recall. Standardcurl ... | shinstall.
Why two: Ollama does not expose --n-cpu-moe, so a 30B-A3B MoE under
Ollama hits naive layer-offload and runs at about a third the speed.
The chat path needs the flag; the embedder path does not.
flowchart LR
M[mic / phone] --> V[VAD]
V --> S[STT]
S --> L[LLM]
L --> T[TTS]
T --> SP[speaker / phone]
M -. "energy > threshold" .-> B{barge-in?}
B -. yes .-> X[cancel chain]
X --> T
Tested on RTX 4070 SUPER 12 GB. The MoE chat model
(Qwen3-30B-A3B-2507) needs --n-cpu-moe 48 on 12 GB cards. Smaller
GPUs can fall back to qwen3:8b via Ollama with about 3 to 5 s to
first audio instead of 1.9. Apple Silicon: untested. Windows audio
I/O: works only via WSL2.
Memory is split into layers because each one answers a different question:
| Layer | Where | Answers |
|---|---|---|
| Profile | voice_agent_data/profile.json, prepended to system prompt at boot |
"Who is this person?" |
| Semantic RAG | SQLiteRAGMemoryStore.search |
"Have they said anything related to what they just said?" |
| Rolling | ConversationMemory, in-process, ~2000 chars |
"What did we say in the last thirty seconds?" |
Profile is curated and always loaded. RAG covers session-spanning recall with a similarity floor that keeps unrelated turns out of the prompt. Rolling carries immediate continuity. Together they cover roughly thirty times the conversation history of "dump everything into the prompt" using about a quarter of the tokens.
There is also a four-layer persona (core + project-specific overlays
- session journal + relationship memo) that the agent updates after each session if reflection is enabled. Off by default.
[phone browser] -- WebSocket --> [server]
|
v
_BrowserAudioBuffer ^
|- _in_queue (mic chunks) |
|- _out_queue (TTS chunks) ---+
|
v
_run_streaming_loop <-- main thread
(polls record_chunk, detects barge-in)
|
| on a finalized utterance
v
_handle_utterance --> _process_text
|
|-- streaming branch --> spawn worker thread
| |
| v
| _stream_response (worker)
| |- producer (asyncio loop)
| | LLM.stream -> SentenceBuffer -> TTS.synth_stream
| |- consumer (this worker)
| chunk_queue -> audio_buffer.play_audio
|
|-- non-streaming branch --> tts_executor.submit(_speak_via_rust)
Three threads matter: the main thread (listen loop, owns mic capture and barge-in detection), a worker thread spawned per streaming response (consumer of the chunk queue), and uvicorn's asyncio loop (hosts the WebSocket handler, the LLM stream coroutine, and the TTS stream coroutine). The main thread polls so that user speech can interrupt the worker's playback at any time.
A few non-obvious things are load-bearing. Each is one edit away from a regression that would take a debugging session to find:
- Listen loop must keep polling during streaming TTS. That is what
makes barge-in possible. The browser audio buffer's
play_audio_thread_safe = Trueis the gate; if you swap in a non-thread-safe buffer, set it back toFalseand accept that barge-in degrades. - The barge-in cancel chain. When user energy crosses the
threshold,
_finish_tts_playback("interrupt")calls_cancel_streaming_if_active, sets_tts_playing = False, and stops the audio buffer; the chunk-pump consumer drops anything still in flight from the producer. Removing theif not self._tts_playing: continueguard in the consumer makes the agent keep talking after a barge-in. _BROWSER_HEAD_START_Smust equalFRESH_HEAD_MSinstatic/index.html. The browser schedules the first chunk of every new utterance 500 ms in the future to absorb jitter; the server's playback-end estimate has to match or the wait loop exits 500 ms early and the last words get cut.- Embeddings are computed at write time, never lazily. L2-normalized for cosine-as-dot-product. If the embedder is unreachable, rows are written with a NULL embedding and a counter ticks; the conversation never blocks on the embedder.
- Recall block is framed as "background, do not restate". Earlier versions used role-tagged exemplars and triggered echo amplification on small models (the LLM copied its own past greetings verbatim). Do not revert.
The deeper list, including the _out_queue drop-newest semantics, the
RAG similarity floor, the _update_tts_state early-return during
streaming, and the rest, lives as code comments where the invariants
exist.
cli.pyis the CLI entrypoint.convers/agent.pyholdsHybridVoiceAgent, state-only; behaviour lives in seven mixins underconvers/orchestrator/(lifecycle,listen_loop,streaming,tts_state,prompt,persona,reflections).convers/services/has pluggable provider implementations undervad/,stt/,tts/,llm/,transport/,memory/,recorder/,graph/,vision/,emotion/,events/,speaker/,tools/.convers/contracts/has the Protocol definitions every provider satisfies, one file per concern.convers/registry.pyis the per-group provider registry plus the built-in autoload list.src/is the Rust audio extension (audio_processor);cargo buildmaturin develop --releaseproduces the Python module.
config/holds model JSON profiles and the persona core.tests/test_smoke.pyis a single-file pytest suite by convention.
The code is Apache-2.0. The default model weights are not all commercial-friendly. Check before deploying:
| Model | License | Commercial? |
|---|---|---|
| Whisper (faster-whisper) | MIT | yes |
| Silero VAD | MIT | yes |
| Piper voices | per voice (mostly MIT) | check per voice |
| XTTS-v2 (Coqui) | CPML | no, non-commercial |
| Coqui TTS code (idiap fork) | MPL-2.0 | yes, with conditions |
| Qwen3-30B-A3B-2507 weights | Apache-2.0 | yes |
The default config/models.json profile uses XTTS-v2, so the default
configuration is non-commercial. To go commercial, swap TTS to Piper
in config/models.example.json and check the per-voice license at
huggingface.co/rhasspy/piper-voices.
convers writes the following to voice_agent_data/:
profile.json, the curated profile loaded at boot.memory.db, the SQLite RAG store with embeddings.- Event JSONLs.
- Optional session journals if reflection is enabled (off by default).
- Optional recorder WAVs if the recorder is enabled (off by default).
- The self-signed HTTPS cert that the browser transport regenerates on first launch (and on LAN-IP change).
No data leaves the machine unless you set SEARXNG_URL for the
optional web-search tool, which talks to a SearXNG instance you
control. There are no remote LLM API keys involved in the tested
topology.
To wipe everything: rm -rf voice_agent_data/.
Apache-2.0. See LICENSE.
Footnotes
-
Reproduce: run
./scripts/bench_llm_endpoint.py --jsonand compare with bench/2026-05-09-rtx4070s.json. Hardware in the artifact: RTX 4070 SUPER 12 GB, llama-server with--n-cpu-moe 48, XTTS-v2 streaming. Your numbers will differ; the JSON is the contract, not the prose. ↩