Skip to content

hsergiu/convers

Repository files navigation

convers

A conversational agent with memory. You can interrupt it mid-sentence and it stops talking. About 1.9 s to first audio on an RTX 4070 SUPER, llama-server + XTTS streaming.1 Local-first. No remote API keys; the only HTTP endpoints it talks to are on 127.0.0.1.

What it does

  • Full-duplex turn-taking with a real barge-in cancellation chain. You speak over the agent, the agent stops mid-word, the streaming TTS gets cancelled and the browser flushes its playback buffer. Documented in the "Runtime invariants" section below.
  • Pluggable VAD / STT / LLM / TTS via Python entry points. Adding a provider is one Protocol implementation plus an entry-point line in pyproject.toml. No core changes.
  • Three-layer memory: rolling buffer (last ~30 seconds), SQLite RAG with embeddings (semantic recall across sessions), curated profile loaded at boot.
  • Browser front-end over HTTPS + WebSocket on your LAN. Your phone is the mic and the speaker; no app install.

Run it

See SETUP.md for install. Once installed, two terminals:

# Terminal 1
./scripts/start_llm_server.sh

# Terminal 2
python cli.py

Open the https://0.0.0.0:8443/ URL it prints, accept the self-signed cert, tap the circle, talk.

What's actually running

Two processes, both local:

  • llama-server on port 8080. Chat LLM. Built from source against llama.cpp so it can use --n-cpu-moe 48, which keeps the 30B MoE expert tensors in CPU RAM and the GPU under 4 GB. About a 5-minute compile. SETUP.md walks through it.
  • Ollama daemon on port 11434. Embedder only, serving nomic-embed-text for RAG memory recall. Standard curl ... | sh install.

Why two: Ollama does not expose --n-cpu-moe, so a 30B-A3B MoE under Ollama hits naive layer-offload and runs at about a third the speed. The chat path needs the flag; the embedder path does not.

Pipeline

flowchart LR
    M[mic / phone] --> V[VAD]
    V --> S[STT]
    S --> L[LLM]
    L --> T[TTS]
    T --> SP[speaker / phone]
    M -. "energy > threshold" .-> B{barge-in?}
    B -. yes .-> X[cancel chain]
    X --> T
Loading

Hardware reality check

Tested on RTX 4070 SUPER 12 GB. The MoE chat model (Qwen3-30B-A3B-2507) needs --n-cpu-moe 48 on 12 GB cards. Smaller GPUs can fall back to qwen3:8b via Ollama with about 3 to 5 s to first audio instead of 1.9. Apple Silicon: untested. Windows audio I/O: works only via WSL2.

Memory model

Memory is split into layers because each one answers a different question:

Layer Where Answers
Profile voice_agent_data/profile.json, prepended to system prompt at boot "Who is this person?"
Semantic RAG SQLiteRAGMemoryStore.search "Have they said anything related to what they just said?"
Rolling ConversationMemory, in-process, ~2000 chars "What did we say in the last thirty seconds?"

Profile is curated and always loaded. RAG covers session-spanning recall with a similarity floor that keeps unrelated turns out of the prompt. Rolling carries immediate continuity. Together they cover roughly thirty times the conversation history of "dump everything into the prompt" using about a quarter of the tokens.

There is also a four-layer persona (core + project-specific overlays

  • session journal + relationship memo) that the agent updates after each session if reflection is enabled. Off by default.

Architecture: the shape of a turn

[phone browser] -- WebSocket --> [server]
                                 |
                                 v
   _BrowserAudioBuffer            ^
   |- _in_queue   (mic chunks)    |
   |- _out_queue  (TTS chunks) ---+
                  |
                  v
   _run_streaming_loop  <-- main thread
   (polls record_chunk, detects barge-in)
                  |
                  | on a finalized utterance
                  v
   _handle_utterance --> _process_text
                          |
                          |-- streaming branch --> spawn worker thread
                          |                         |
                          |                         v
                          |                       _stream_response (worker)
                          |                         |- producer  (asyncio loop)
                          |                         |    LLM.stream -> SentenceBuffer -> TTS.synth_stream
                          |                         |- consumer  (this worker)
                          |                              chunk_queue -> audio_buffer.play_audio
                          |
                          |-- non-streaming branch --> tts_executor.submit(_speak_via_rust)

Three threads matter: the main thread (listen loop, owns mic capture and barge-in detection), a worker thread spawned per streaming response (consumer of the chunk queue), and uvicorn's asyncio loop (hosts the WebSocket handler, the LLM stream coroutine, and the TTS stream coroutine). The main thread polls so that user speech can interrupt the worker's playback at any time.

Runtime invariants (read before editing the orchestrator)

A few non-obvious things are load-bearing. Each is one edit away from a regression that would take a debugging session to find:

  • Listen loop must keep polling during streaming TTS. That is what makes barge-in possible. The browser audio buffer's play_audio_thread_safe = True is the gate; if you swap in a non-thread-safe buffer, set it back to False and accept that barge-in degrades.
  • The barge-in cancel chain. When user energy crosses the threshold, _finish_tts_playback("interrupt") calls _cancel_streaming_if_active, sets _tts_playing = False, and stops the audio buffer; the chunk-pump consumer drops anything still in flight from the producer. Removing the if not self._tts_playing: continue guard in the consumer makes the agent keep talking after a barge-in.
  • _BROWSER_HEAD_START_S must equal FRESH_HEAD_MS in static/index.html. The browser schedules the first chunk of every new utterance 500 ms in the future to absorb jitter; the server's playback-end estimate has to match or the wait loop exits 500 ms early and the last words get cut.
  • Embeddings are computed at write time, never lazily. L2-normalized for cosine-as-dot-product. If the embedder is unreachable, rows are written with a NULL embedding and a counter ticks; the conversation never blocks on the embedder.
  • Recall block is framed as "background, do not restate". Earlier versions used role-tagged exemplars and triggered echo amplification on small models (the LLM copied its own past greetings verbatim). Do not revert.

The deeper list, including the _out_queue drop-newest semantics, the RAG similarity floor, the _update_tts_state early-return during streaming, and the rest, lives as code comments where the invariants exist.

Module layout

  • cli.py is the CLI entrypoint.
  • convers/agent.py holds HybridVoiceAgent, state-only; behaviour lives in seven mixins under convers/orchestrator/ (lifecycle, listen_loop, streaming, tts_state, prompt, persona, reflections).
  • convers/services/ has pluggable provider implementations under vad/, stt/, tts/, llm/, transport/, memory/, recorder/, graph/, vision/, emotion/, events/, speaker/, tools/.
  • convers/contracts/ has the Protocol definitions every provider satisfies, one file per concern.
  • convers/registry.py is the per-group provider registry plus the built-in autoload list.
  • src/ is the Rust audio extension (audio_processor); cargo build
    • maturin develop --release produces the Python module.
  • config/ holds model JSON profiles and the persona core.
  • tests/test_smoke.py is a single-file pytest suite by convention.

Models, weights, licenses

The code is Apache-2.0. The default model weights are not all commercial-friendly. Check before deploying:

Model License Commercial?
Whisper (faster-whisper) MIT yes
Silero VAD MIT yes
Piper voices per voice (mostly MIT) check per voice
XTTS-v2 (Coqui) CPML no, non-commercial
Coqui TTS code (idiap fork) MPL-2.0 yes, with conditions
Qwen3-30B-A3B-2507 weights Apache-2.0 yes

The default config/models.json profile uses XTTS-v2, so the default configuration is non-commercial. To go commercial, swap TTS to Piper in config/models.example.json and check the per-voice license at huggingface.co/rhasspy/piper-voices.

Privacy and on-disk data

convers writes the following to voice_agent_data/:

  • profile.json, the curated profile loaded at boot.
  • memory.db, the SQLite RAG store with embeddings.
  • Event JSONLs.
  • Optional session journals if reflection is enabled (off by default).
  • Optional recorder WAVs if the recorder is enabled (off by default).
  • The self-signed HTTPS cert that the browser transport regenerates on first launch (and on LAN-IP change).

No data leaves the machine unless you set SEARXNG_URL for the optional web-search tool, which talks to a SearXNG instance you control. There are no remote LLM API keys involved in the tested topology.

To wipe everything: rm -rf voice_agent_data/.

License

Apache-2.0. See LICENSE.

Footnotes

  1. Reproduce: run ./scripts/bench_llm_endpoint.py --json and compare with bench/2026-05-09-rtx4070s.json. Hardware in the artifact: RTX 4070 SUPER 12 GB, llama-server with --n-cpu-moe 48, XTTS-v2 streaming. Your numbers will differ; the JSON is the contract, not the prose.

About

Local voice agent (eng) with barge-in. Three layer memory, browser as mic over LAN, ~1.9 s to first audio on a 4070 Super

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors