A fast, lightweight voice transcription tool supporting multiple state-of-the-art ASR models. Press a hotkey to record, and your speech is instantly transcribed and pasted where you need it.
- ποΈ One-key recording: Hold Ctrl+Space to record, release to transcribe (push-to-talk)
- π€ Multiple ASR backends: Parakeet, Moonshine, SenseVoice support
- π Multilingual: Support for multiple languages depending on model choice
- π Smart auto-paste: Automatically detects terminals for correct paste method (Ctrl+Shift+V vs Ctrl+V)
- π§ Optional LLM post-processing: Clean up transcriptions with a local ONNX model or cloud API (Anthropic, OpenAI, Gemini)
- π History tracking: Last 100 transcripts saved to
history.json(no audio files) - β‘ Fast daemon mode: Pre-loaded model for instant transcription
- β¨οΈ Near-zero paste latency: Persistent uinput device eliminates per-paste Wayland registration delay
- π Privacy first: Runs 100% locally by default; cloud LLM is opt-in
- π― Wayland native: Tested on Hyprland; works on Sway and other compositors
- π¦ On-demand model downloads: Models downloaded from HuggingFace on first use
- Rust 1.70 or later
- Microphone access
Fedora / RHEL / CentOS:
sudo dnf install -y \
gtk3-devel \
gdk-pixbuf2-devel \
cairo-gobject-devel \
glib2-devel \
pkgconf-pkg-configUbuntu / Debian:
sudo apt install -y \
libgtk-3-dev \
libgdk-pixbuf-2.0-dev \
libcairo-gobject2 \
libglib2.0-dev \
pkg-configgit clone https://github.com/yourusername/vox
cd vox
chmod +x install_linux.sh
./install_linux.shThe installer will:
- Build the daemon and client binaries
- Install to
~/.local/bin/ - Optionally set up systemd service
- Configure compositor-specific hotkeys (Hyprland/Sway)
Enable auto-paste (required for Ctrl+V injection):
sudo setcap "cap_dac_override+p" $(which vox-daemon)cargo build --release --bin vox-daemon --bin vox
cp target/release/vox-daemon ~/.local/bin/
cp target/release/vox ~/.local/bin/Bind two keys in your compositor β one to start, one to stop:
Hyprland (~/.config/hypr/hyprland.conf):
# Basic: transcribe only
bind = CTRL, SPACE, exec, vox start
bindr = CTRL, SPACE, exec, vox stop
# With LLM cleanup (uses configured LLM)
bind = CTRL SHIFT, SPACE, exec, vox start --llm
bindr = CTRL SHIFT, SPACE, exec, vox stop
# With a custom one-off prompt
bind = SUPER, SPACE, exec, vox start --prompt "Format as bullet points: {text}"
bindr = SUPER, SPACE, exec, vox stop
vox toggle # start if stopped, stop if recordingvox status # check daemon status
vox quit # stop daemon
vox model list # list available models
vox model pull <id> # download a model
vox model set <id> # switch active ASR modelOn first run the daemon writes a template to ~/.config/vox/config.toml with all options commented out. Edit it to enable what you need:
# Transcription model to use.
active_model = "parakeet-tdt-0.6b-v3-int8"
# Automatically paste after transcription (default: true).
auto_paste = true
# Number of CPU threads for inference (default: 4).
# intra_threads = 4
# Optional: microphone override.
# input_device = "Built-in Microphone"
# ββ LLM post-processing ββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Run LLM on every recording by default. Can also be triggered per-recording
# with `vox start --llm` regardless of this setting.
# process_transcription = false
# Prompt template. {text} is replaced with the raw transcript.
# process_prompt = "Clean up the following voice transcription: ..."
# ββ Option A: Local ONNX model (private, offline) ββββββββββββββββββββββββββ
# llm_model = "qwen3.5-0.8b-fp16"
# ββ Option B: Cloud API ββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Anthropic Claude
# llm_api_provider = "anthropic"
# llm_api_key = "sk-ant-..."
# llm_api_model = "claude-haiku-4-5"
# OpenAI
# llm_api_provider = "openai"
# llm_api_key = "sk-..."
# llm_api_model = "gpt-4o-mini"
# Google Gemini
# llm_api_provider = "gemini"
# llm_api_key = "AIza..."
# llm_api_model = "gemini-2.0-flash"Models are stored in ~/.config/vox/models/ and downloaded on demand.
Parakeet (NVIDIA) β High accuracy English ASR (default)
| ID | Size | Notes |
|---|---|---|
parakeet-tdt-0.6b-v3-int8 |
900 MB | Default, best accuracy/speed |
parakeet-tdt-0.6b-v3 |
2.4 GB | FP32 |
Moonshine (UsefulSensors) β Ultra-fast streaming ASR
| ID | Size | Notes |
|---|---|---|
moonshine-tiny |
110 MB | ~25Γ faster than real-time |
moonshine-base |
245 MB | Better accuracy |
SenseVoice (Alibaba) β Multilingual with emotion detection
| ID | Size | Notes |
|---|---|---|
sensevoice-small |
937 MB | Chinese / Japanese / Korean / English |
vox model list # see all models + download status
vox model pull <id> # download
vox model set <id> # switch (restarts daemon)Vox can optionally clean up transcriptions through an LLM. The LLM worker runs in the background and is triggered either per-recording (--llm flag) or automatically for every recording (process_transcription = true).
Priority: local ONNX model βΊ cloud API. If only one is configured, that one is used.
No API key, no network access, runs on CPU.
# Download a model first
vox model pull qwen3.5-0.8b-fp16
# Then set in config.toml:
# llm_model = "qwen3.5-0.8b-fp16"Available LLM models:
| ID | Size | Notes |
|---|---|---|
qwen3.5-0.8b-fp16 |
~2 GB | Best quality local option |
qwen3.5-0.8b-q4 |
~670 MB | Quantized, faster |
qwen3.5-2b-q4 |
~1.5 GB | Larger, more capable |
qwen3.5-4b-q4 |
~2.5 GB | Largest local option |
gemma-3-270m-it-q4 |
~800 MB | Google Gemma 3 270M |
gemma-3-270m-it-fp16 |
~570 MB | Google Gemma 3 270M FP16 |
gemma-3-1b-it-q4 |
~1.7 GB | Google Gemma 3 1B |
Set llm_api_key and optionally llm_api_provider and llm_api_model.
| Provider | llm_api_provider |
Default model | Key format |
|---|---|---|---|
| Anthropic Claude | anthropic (default) |
claude-haiku-4-5 |
sk-ant-... |
| OpenAI | openai |
gpt-4o-mini |
sk-... |
| Google Gemini | gemini |
gemini-2.0-flash |
AIza... |
# Use LLM for this recording (regardless of process_transcription setting)
vox start --llm
# Use a custom prompt for this recording
vox start --prompt "Summarise in one sentence: {text}"- Audio:
cpalfor cross-platform audio capture - ASR inference: ONNX Runtime (
ort) - Keyboard injection:
evdevuinput (Linux) /enigo(macOS/Windows) - Clipboard:
wl-copy(Wayland) /arboard(X11, macOS, Windows) - LLM inference: local ONNX backends (Qwen3.5, Gemma 3) or cloud REST APIs
- IPC: newline-delimited JSON over a Unix domain socket
- CLI:
clap
# Grant uinput access
sudo setcap "cap_dac_override+p" $(which vox-daemon)vox model pull <id> # retry download- Local model: check it is downloaded (
vox model list) andllm_modelis set in config - API: check the API key is correct and
llm_api_providermatches the key type
- Check microphone permissions in system settings
- Use
vox device listto see available devices, then setinput_devicein config
MIT / Apache-2.0