Curated list of AI tools that run 100% on your machine — no cloud, no telemetry, no "local-ish" setups that secretly phone home.
Maintained by Brethof AI. Companion to awesome-llms-txt and awesome-private-ai.
The phrase "local-first" has been stretched to mean almost anything in 2026. Lots of tools advertise on-device AI but then call out to remote APIs for "complex" prompts, sync your data to a "private" cloud bucket, or ship telemetry that re-introduces every privacy problem the local mode was meant to solve.
This list applies a strict rule: the tool must perform inference on hardware you own and not transmit prompts, embeddings, or audio off that hardware during normal use. Optional cloud features (like model download or update checks) are fine; mandatory ones disqualify.
Where a tool is partially-local (e.g. ships a fully-local mode but defaults to cloud), we say so explicitly. Every link is checked — a CI-style sweep cuts entries whose URL 404s.
To be listed:
- Inference happens on the user's CPU, GPU, NPU, or local accelerator.
- No mandatory account creation just to use the offline mode.
- Source code or signed binaries available — verifiable provenance.
- Maintained: a release, commit, or PR merge in the last 6 months.
- Real artefact (not a "coming soon" landing page).
- 🐧 Linux · 🪟 Windows · 🍎 macOS · 📱 iOS / Android · 🌐 Web (in-browser)
- 🔓 open source · 🔒 closed source
- 🆓 free for personal · 💰 paid · 🆓💰 free + paid tiers
- 🐍 Python · 🦀 Rust · 🐹 Go · ⚙️ C/C++ · 🟦 TypeScript / JS
- Inference Runtimes (13)
- Desktop Chat Apps (4)
- Voice — Speech-to-Text (7)
- Voice — Text-to-Speech (6)
- Image Generation (7)
- Video Generation (2)
- Code Assistants (5)
- Local Agents (3)
- Vector Databases (6)
- Embeddings (3)
- Training & Fine-tuning (5)
- Local Search & RAG (5)
- Operating Systems Tuned for AI (5)
- Hardware-Specific Runtimes (4)
Engines that load LLMs, vision models, and other neural networks for inference on your hardware.
- ExLlamaV2 — 🐧 🪟 🔓 🆓 🐍
Fast inference for quantised LLMs on consumer NVIDIA GPUs. EXL2 format outperforms GGUF on the same hardware in many benchmarks. - GPT4All — 🐧 🪟 🍎 🔓 🆓 ⚙️
Privacy-first desktop chat with curated quantised models. Strong CPU performance. - Jan — 🐧 🪟 🍎 🔓 🆓 🟦
Open-source ChatGPT alternative. Bundles llama.cpp + a clean UI. - KoboldCpp — 🐧 🪟 🍎 🔓 🆓 ⚙️
Single-binary llama.cpp wrapper with KoboldAI UI for chat, story-writing, RP. - llama.cpp — 🐧 🪟 🍎 🔓 ⚙️
Reference C++ implementation for running LLaMA-family and other transformer models with GGUF quantization. Powers most of the others in this section. - LM Studio — 🐧 🪟 🍎 🔒 🆓 ⚙️
Polished desktop app for discovering, downloading, and running local LLMs. OpenAI-compatible server mode. Free for personal + commercial. - LocalAI — 🐧 🪟 🍎 🔓 🆓 🐹
Self-hosted, OpenAI-compatible inference server. Text, image, audio, embeddings — all on your machine. - Mistral.rs — 🐧 🪟 🍎 🔓 🆓 🦀
Rust LLM inference platform with quantization, vision, MoE, and speculative decoding. - MLC LLM — 🐧 🪟 🍎 📱 🌐 🔓 🆓 🐍
Compile-once, deploy-anywhere LLM runtime. Targets WebGPU, Vulkan, CUDA, Metal, iOS, and Android from a single source. - Ollama — 🐧 🪟 🍎 🔓 🆓 🐹
Single-binary server with a built-in model library. Pull, run, and swap models with one command. - SGLang — 🐧 🔓 🆓 🐍
Fast LLM and VLM serving runtime with RadixAttention cache and structured-output support. - Text Generation WebUI — 🐧 🪟 🍎 🔓 🆓 🐍
Gradio-based web UI for local LLMs. Supports GGUF, GPTQ, AWQ, EXL2. - vLLM — 🐧 🔓 🆓 🐍
High-throughput inference engine with PagedAttention. Designed for serving, not desktop chat — pair with Open WebUI or LiteLLM.
GUI applications wrapping a local runtime in a chat interface.
- Anything LLM — 🐧 🪟 🍎 🔓 🆓 🟦
Workspace-style chat with built-in RAG. Works fully offline with a local LLM provider. - Faraday — 🐧 🪟 🍎 🔒 🆓 ⚙️
Local-only character / role-play chat. Bundles inference, no API key needed. - Msty — 🐧 🪟 🍎 🔒 🆓 🟦
Fast desktop chat with branching conversations and parallel-model comparison. Free tier covers personal local use. - Open WebUI — 🐧 🪟 🍎 🌐 🔓 🆓 🐍
Self-hosted "ChatGPT clone" of the open-source world. Pair with Ollama or any OpenAI-compatible local server.
- Brethof Voice Pro — 🐧 🪟 🔒 🆓 💰 ⚙️
Desktop dictation app built on Qwen3-ASR + GGUF + llama.cpp. 36 languages, hotkey-anywhere transcription, file/microphone/system-audio input, LoRA personal voice training. 100% offline, no account required to transcribe. Disclosure: maintained by us. - faster-whisper — 🐧 🪟 🍎 🔓 🆓 🐍
CTranslate2-based reimplementation. ~4× faster than reference Whisper at the same accuracy. - OpenAI Whisper — 🐧 🪟 🍎 🔓 🆓 🐍
Reference Python implementation. Accurate but slower than the C++ ports; useful when you need the exact research behaviour. - RealtimeSTT — 🐧 🪟 🍎 🔓 🆓 🐍
Low-latency streaming wrapper around faster-whisper for live dictation pipelines. - Vosk — 🐧 🪟 🍎 📱 🔓 🆓 🐍
Lightweight offline speech recognizer with 20+ language models. Real-time on CPU. - Whisper.cpp — 🐧 🪟 🍎 📱 🔓 🆓 ⚙️
C++ port of OpenAI Whisper with GGUF quantization. Runs on CPU, Metal, CUDA, Vulkan. - WhisperX — 🐧 🪟 🍎 🔓 🆓 🐍
faster-whisper plus forced alignment, voice-activity detection, and speaker diarization.
- Bark — 🐧 🪟 🍎 🔓 🆓 🐍
Multilingual generative audio. Speech, sound effects, and music cues from text prompts. - Coqui TTS — 🐧 🪟 🍎 🔓 🆓 🐍
Comprehensive TTS toolkit. Multiple architectures (Tacotron, VITS, XTTS) and voice cloning. - Kokoro — 🐧 🪟 🍎 🔓 🆓 🐍
Tiny ~80M-param TTS model, surprisingly natural for the size. Suitable for low-end hardware. - Mimic 3 — 🐧 🪟 🍎 📱 🔓 🆓 🐍
Mycroft's neural TTS engine. Lightweight, multilingual. - Piper — 🐧 🪟 🍎 📱 🔓 🆓 ⚙️
Fast neural TTS. ONNX runtime, dozens of voices and languages. Designed for Raspberry Pi-class hardware. - StyleTTS 2 — 🐧 🪟 🍎 🔓 🆓 🐍
High-fidelity expressive TTS with style transfer. Strong reference voice cloning.
- AUTOMATIC1111 / Stable Diffusion WebUI — 🐧 🪟 🍎 🔓 🆓 🐍
The original ergonomic SD UI. Heavy plugin ecosystem. - ComfyUI — 🐧 🪟 🍎 🔓 🆓 🐍
Node-graph workflow editor for diffusion models. Powers most modern local image and video pipelines. - Fooocus — 🐧 🪟 🍎 🔓 🆓 🐍
Image generator with sane defaults — minimal knobs for great results. Built on top of Stable Diffusion. - Forge — 🐧 🪟 🍎 🔓 🆓 🐍
Performance-tuned A1111 fork by lllyasviel. Lower VRAM, faster on modern GPUs. - InvokeAI — 🐧 🪟 🍎 🔓 🔒 🆓 💰 🐍
Pro-grade SD UI with strong canvas / inpainting tools. Enterprise tier; free local install remains open source. - SD.Next — 🐧 🪟 🍎 🔓 🆓 🐍
All-in-one fork of A1111 with broader backend support (Diffusers, ONNX, ROCm). - SwarmUI — 🐧 🪟 🍎 🔓 🆓 🟦
Modular UI built on top of ComfyUI. User-friendly mode out of the box, full node-graph available when you need it.
- ComfyUI + LTX Video — 🐧 🪟 🍎 🔓 🆓 🐍
ComfyUI nodes drive Lightricks LTX video models for text-to-video and image-to-video generation. The chunked-loop pattern (released in our comfyui-workflows) produces longer outputs than vanilla LTX allows. - Wan2GP — 🐧 🪟 🍎 🔓 🆓 🐍
Stripped-down Wan2.2 video pipeline for low-VRAM consumer GPUs.
- Aider — 🐧 🪟 🍎 🔓 🆓 🐍
Terminal pair-programming. Bring-your-own-LLM via LiteLLM — run with Ollama or any OpenAI-compatible local endpoint. - Continue — 🐧 🪟 🍎 🔓 🆓 🟦
IDE assistant with first-class local-LLM support. Defaults can be set to Ollama / LM Studio. VS Code + JetBrains. - Llama.vim — 🐧 🪟 🍎 🔓 🆓 ⚙️
Vim plugin that streams llama.cpp completions inline. No cloud. - Tabby — 🐧 🪟 🍎 🔓 🆓 🦀
Self-hosted GitHub Copilot alternative. Local model serving with IDE plugins. - twinny — 🐧 🪟 🍎 🔓 🆓 🟦
Free local AI extension for VS Code. Chat + autocomplete via Ollama.
- Aider in /architect mode — 🐧 🪟 🍎 🔓 🆓 🐍
Aider's planning mode separates "decide" and "edit" steps; works well with strong local reasoning models. - Continue Agent mode — 🐧 🪟 🍎 🔓 🆓 🟦
Agentic editing flow inside Continue. Pair with a local model for fully-offline coding agents. - Open Interpreter — 🐧 🪟 🍎 🔓 🆓 🐍
Code-execution agent that runs Python/shell on your machine. Local-LLM friendly.
- Chroma — 🐧 🪟 🍎 🔓 🆓 🐍
Embedding database designed for local-first usage. SQLite-style single-file or client/server. - Faiss — 🐧 🪟 🍎 🔓 🆓 ⚙️
Library for similarity search. The retrieval engine inside many of the others. - LanceDB — 🐧 🪟 🍎 🔓 🆓 🦀
Embedded, columnar vector DB. Single-file, no server. - Marqo — 🐧 🪟 🍎 🔓 🔒 🆓 💰 🐍
End-to-end vector search; OSS core, paid hosted version. - Qdrant — 🐧 🪟 🍎 🔓 🆓 🦀
High-performance vector DB. Self-host the open-source binary. - Weaviate — 🐧 🪟 🍎 🔓 🆓 🐹
Hybrid (vector + keyword) DB. Self-host the OSS distribution; cloud is optional.
- BGE — 🐧 🪟 🍎 🔓 🆓 🐍
BAAI's BGE family. Strong English + multilingual variants. Run via llama.cpp, sentence-transformers, or fastembed. - fastembed — 🐧 🪟 🍎 🔓 🆓 🐍
Lightweight CPU-friendly embedding library by Qdrant. - Sentence Transformers — 🐧 🪟 🍎 🔓 🆓 🐍
Reference Python library for sentence + paragraph embeddings.
- Axolotl — 🐧 🔓 🆓 🐍
Config-driven fine-tuning framework. LoRA, QLoRA, full fine-tunes. - diffusion-pipe — 🐧 🔓 🆓 🐍
Pipeline-parallel trainer for diffusion models. Multi-GPU LoRA on large image / video models. - MLX — 🍎 🔓 🆓 🐍
Apple's native ML framework for Apple Silicon. Train and infer on M-series Macs without CUDA workarounds. - Ostris ai-toolkit — 🐧 🪟 🔓 🆓 🐍
LoRA training UI for Flux, SD3, SDXL, LTX. Works on consumer hardware. - Unsloth — 🐧 🪟 🍎 🔓 🆓 🐍
Fine-tune LLMs 2× faster with 70% less VRAM than reference HuggingFace pipelines.
- Anything LLM — 🐧 🪟 🍎 🔓 🆓 🟦
Self-hosted workspace tool with integrated RAG. Listed twice intentionally — strong both as a chat app and a RAG layer. - LlamaIndex — 🐧 🪟 🍎 🔓 🆓 🐍
Toolkit for building RAG pipelines. Works fully offline with local models + vector DBs. - Perplexica — 🐧 🪟 🍎 🔓 🆓 🟦
Open-source AI search powered by SearXNG + your local LLM. - PrivateGPT — 🐧 🪟 🍎 🔓 🆓 🐍
Ingest documents locally and query them with an offline LLM. - SearXNG — 🐧 🪟 🍎 🔓 🆓 🐍
Self-hosted meta-search engine. Pair with a local LLM for an offline Perplexity-style assistant.
- Bazzite — 🐧 🔓 🆓
Container-native gaming and AI distro. Steam Deck-friendly, latest drivers, easy CUDA. - Bluefin — 🐧 🔓 🆓
Fedora-based, atomic, container-first. Good "drop you in a known state" workstation for AI work. - CachyOS — 🐧 🔓 🆓
Arch-based desktop distro with a tuned kernel and recent NVIDIA / AMD drivers. Sane out-of-the-box for new GPUs (Blackwell, RDNA 4). - NixOS — 🐧 🔓 🆓
Reproducible system config. Best when you need identical CUDA + ML toolchain across machines. - Pop!_OS — 🐧 🔓 🆓
System76's NVIDIA-friendly desktop distro. ISO ships with proprietary drivers for plug-and-play GPU work.
- MLX — 🍎 🔓 🆓 🐍
Apple Silicon-native ML library. Already listed under training; it also ships an inference runtime competitive with llama.cpp on M-series. - NVIDIA TensorRT-LLM — 🐧 🪟 🔒 🆓 🐍
NVIDIA's optimised LLM runtime for their data-center and consumer GPUs. Closed-weights binary; fastest CUDA path for many models. - OpenVINO — 🐧 🪟 🍎 🔓 🆓 ⚙️
Intel's inference toolkit. CPU, iGPU, dGPU (Arc), and NPU support for Intel laptops. - ROCm + llama.cpp HIP — 🐧 🪟 🍎 🔓 ⚙️
AMD GPU inference path. Llama.cpp's HIP backend now reaches CUDA parity on RDNA 3/4 in many workloads.
- awesome-llms-txt — Tools that publish
llms.txtfor agent discovery. - awesome-private-ai — Privacy-first AI more broadly (some on this list, plus privacy-respecting cloud).
- awesome-mcp-servers — MCP servers, many of which sit happily next to a local LLM.
- awesome-ai-minefield — License + ToS analysis for the models you'll run locally.
- awesome-linux-for-ai — Linux distros tuned for the AI workstations these tools live on.
- comfyui-workflows — Curated, working ComfyUI workflows for local image / video generation.
Open an issue with the tool name, repo or homepage URL, the category it
should land in, and one paragraph on why it's worth listing. Entries live
as one YAML file each under entries/; this README is generated from them,
so edit the YAML, not the list above. We do not list tools whose offline
mode is gated behind a paid plan.
MIT.
Maintained by Brethof AI — AI tools built for people who take their data seriously.