A local AI agent that attends your calls, listens to conversations, and responds in your cloned voice. All AI inference runs locally on Apple Silicon — no cloud APIs for transcription, LLM, or voice synthesis.
Works with any application that outputs audio — Google Meet, Zoom, Teams, Discord, FaceTime, or anything else. Just route the app's audio through BlackHole and the agent picks it up. Google Meet is used as a reference throughout this README, but the agent is app-agnostic.
Disclaimer: This is a fun experimental project built and tested only on Apple Silicon Macs (M4). It's not production-ready, will occasionally say odd things, and the latency is noticeable. Use it to amuse yourself, not to fool your boss.
Google Meet Audio → BlackHole → Voxtral ASR → Intent Classifier → LLM Response → VoiceBox TTS → Speakers
- Audio Capture — Routes Google Meet audio through BlackHole virtual audio device
- Speech-to-Text — voxtral.c (Mistral's Voxtral Realtime 4B) transcribes speech in real-time
- Intent Detection — Local LLM via Ollama decides if the utterance needs a response (greeting, question, sign-off, etc.)
- Response Generation — Same LLM generates a contextual response using your meeting prep notes, handling noisy ASR by inferring intent from context
- Voice Synthesis — VoiceBox clones your voice and speaks the response through your speakers
┌──────────────┐ ┌───────────────┐ ┌──────────────┐
│ BlackHole │────▶│ voxtral.c │────▶│ Transcript │
│ 16ch │ │ (ASR) │ │ Buffer │
└──────────────┘ └───────────────┘ └──────┬───────┘
│
┌───────────────┐ │
│ Meeting │ │
│ Context (.md)│────┐ │
└───────────────┘ │ │
▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Speakers │◀────│ VoiceBox │◀─│ Ollama LLM │
│ (audio out) │ │ (TTS) │ │ (brain) │
└──────────────┘ └──────────────┘ └──────────────┘
The orchestrator manages a state machine: IDLE → LISTENING → DETECTING → THINKING → SPEAKING → IDLE
The terminal UI shows two live panels:
- Live Transcript — real-time ASR output from the call
- Conversation Log — full dialogue (Colleague/You), pipeline status, and intent decisions
- macOS with Apple Silicon (M4 recommended)
- 24GB+ RAM (runs everything locally but tight — 48-64GB recommended)
- Python 3.11+
- BlackHole 16ch — virtual audio driver
- Ollama — local LLM server
- VoiceBox — voice cloning TTS app
- voxtral.c — compiled from source (see below)
If you're RAM-constrained (24GB), you can offload Ollama to a second Mac on your local network:
- Primary Mac: voxtral.c (ASR) + VoiceBox (TTS) + audio capture
- Secondary Mac: Ollama (LLM)
On the secondary Mac:
brew install ollama
ollama pull qwen3:8b
OLLAMA_HOST=0.0.0.0 ollama serveThen in your config.yaml on the primary Mac:
llm:
base_url: "http://192.168.1.x:11434" # your secondary Mac's local IPNote: Intermittent "No route to host" errors can occur with remote Ollama over WiFi. The agent has built-in retry logic (3 attempts), but for reliability, running Ollama locally is recommended.
# BlackHole virtual audio driver
brew install blackhole-16ch
# Ollama
brew install ollama
ollama pull llama3.1:8b # or: qwen3:8b, phi4-mini, gemma3:4b
# VoiceBox — download from https://voicebox.sh
# After install, create a voice profile by recording a ~30s sample of your voice
# Python environment
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtmkdir -p vendor
git clone https://github.com/antirez/voxtral.c.git vendor/voxtral.c
cd vendor/voxtral.c
make # requires Xcode Command Line Tools
./download_model.sh # ~2GB model weights
# Verify
./voxtral -d voxtral-model samples/jfk.wav
cd ../..- Set your macOS system sound output to
BlackHole 16ch - In Google Meet, Chrome will use the system default output (BlackHole)
- The agent captures from BlackHole and plays responses through your actual speakers
Important: Chrome latches audio devices at launch. If you change the output device, quit Chrome fully and reopen it.
cp config.example.yaml config.yamlEdit config.yaml:
agent.name— your nameagent.trigger_names— add common ASR misrecognitions of your nameaudio.capture_device—"BlackHole 16ch"audio.playback_device— your speaker device namellm.model— Ollama model namellm.base_url— Ollama URL (localhost or remote machine)tts.voice_profile_id— your VoiceBox voice profile ID
The meeting context file is the most important part — it's what makes the agent give relevant answers instead of generic ones. Think of it as your meeting prep notes that the agent reads before the call.
cp meetings/example.md meetings/current.mdEdit meetings/current.md before every call with:
| Section | What to put | Why it matters |
|---|---|---|
| Your Key Context | Facts, status updates, what you shipped, what's in progress | The agent uses this to answer "what's the status?" questions |
| Your Positions | Pre-loaded answers for expected questions (timeline, risks, blockers) | Direct control over what the agent says — "If asked about X, say Y" |
| Communication Style | How you speak (direct, casual, formal) | Keeps responses sounding like you |
| Things to Avoid | Topics, numbers, or commitments the agent should never make up | Prevents hallucination and overpromising |
Example — before a weekly sync:
## Your Key Context:
- Finished the API refactor, pushed to staging on Wednesday
- Production is stable, no incidents this week
- Blocked on design review for the dashboard redesign
## Your Positions:
- If asked about timeline: API refactor ships Monday, dashboard depends on design team
- If asked about risks: only risk is the design dependency
- If asked about production: all green, no issuesThe better your prep, the better the responses. The agent will say "let me get back to you on that" for anything not covered — which is the right thing to do.
source .venv/bin/activate
# Run with terminal UI (recommended)
python -m src.main
# Run with debug logging (first run)
python -m src.main --debug
# Watch the pipeline in another terminal
tail -f voiceagent.log
# Listen-only mode (transcribe but don't respond)
python -m src.main --listen-only
# No UI mode (prints transcript to stdout)
python -m src.main --no-ui
# Custom meeting context
python -m src.main --meeting meetings/my-standup.mdThe agent runs a live Rich terminal dashboard with:
┌─ Voice Agent ────────────────────────────────┐
│ Status: LISTENING [M]ute [F]orce [S]kip [Q]uit │
├──────────────────────────────────────────────┤
│ Live Transcript (ASR) │
│ Hey good morning, how's the project going? │
├───────────────────────────────────────────────┤
│ Conversation Log │
│ >> All systems ready — waiting for colleague │
│ Colleague: Hey good morning, how's the │
│ project going? │
│ >> Silence detected, analyzing intent... │
│ >> Classifying intent via LLM... │
│ >> Responding to: greeting + project status │
│ >> Generating response via LLM... │
│ >> Synthesizing voice... │
│ You: Good morning! The API refactor shipped │
│ to staging Wednesday, all green so far. │
│ >> Waiting for colleague to speak... │
├───────────────────────────────────────────────┤
│ Meeting: Weekly Sync | Latency: Intent 2.1s │
│ LLM 3.4s TTS 8.2s │
└───────────────────────────────────────────────┘
| Key | Action |
|---|---|
M |
Toggle mute (still transcribes but won't respond) |
F |
Force respond to the last utterance |
S |
Skip/stop current response playback |
Q |
Quit |
src/
├── main.py # CLI entry point
├── orchestrator.py # Central state machine + pipeline
├── audio/
│ ├── capture.py # BlackHole audio capture (48kHz→16kHz resampling)
│ ├── devices.py # Audio device discovery
│ └── playback.py # Speaker output
├── asr/
│ └── voxtral.py # voxtral.c subprocess wrapper (stdin/stdout)
├── brain/
│ ├── context.py # Meeting markdown parser
│ ├── intent.py # Intent classifier with retry logic
│ ├── gate.py # Confidence threshold gate
│ └── responder.py # Response generator with retry logic
├── transcript/
│ └── buffer.py # Rolling transcript buffer
├── voice/
│ └── tts.py # VoiceBox TTS client
└── ui/
└── terminal.py # Rich terminal dashboard
# Test BlackHole audio loopback
python scripts/test_blackhole.py
# Test audio capture from BlackHole
python scripts/test_audio_pipeline.py
# Test live transcription
python scripts/test_live_transcription.py
# Test intent classification and response generation
python scripts/test_brain.py- Latency — End-to-end response takes 15-30s on 24GB machines (intent ~2-5s + response ~3-5s + TTS ~8-15s). More RAM and a dedicated GPU machine help significantly.
- ASR quality — Small model (4B) on constrained memory misses words, especially with low audio. The LLM prompts are tuned to infer meaning from noisy transcriptions using meeting context.
- ASR restarts — On memory-constrained machines, voxtral.c stops during TTS to free GPU memory, causing a few seconds of deaf time after each response.
- Single speaker — Optimized for 1-on-1 calls. Group calls would need diarization (not yet implemented).
- English only — ASR and TTS are configured for English.
- macOS only — Depends on BlackHole, Metal GPU acceleration, and macOS audio APIs.
| Component | Technology | License |
|---|---|---|
| ASR | voxtral.c (Voxtral Realtime 4B) | Apache 2.0 |
| LLM | Ollama + Llama 3.1 8B / Qwen3 8B | Various |
| TTS | VoiceBox (Qwen3-TTS) | MIT |
| Audio | BlackHole 16ch + sounddevice | MIT |
| UI | Rich terminal dashboard | MIT |
| Language | Python 3.11+ with asyncio | — |
MIT