AlienWalker1995 · AlienWalker1995 · Jul 1, 2026 · Jul 1, 2026
diff --git a/README.md b/README.md
@@ -88,7 +88,7 @@ All UI ports below are **internal** (container-network). Operators reach them vi
 ./compose up -d
 ```
 
-**CPU-only / minimal services:** bring up a subset after init, e.g. `./compose up -d ollama dashboard open-webui`.
+**CPU-only / minimal services:** bring up a subset after init, e.g. `./compose up -d llamacpp dashboard open-webui`.
 
 ## Installation
 
@@ -158,7 +158,7 @@ Large optional downloads on demand; first run can take a long time. Pull via the
 
 ### GPU / compute
 
-Hardware detection writes **`overrides/compute.yml`**. The `compose` wrapper runs detection before commands. **No GPU:** use a minimal service set (`./compose up -d ollama dashboard open-webui`); ComfyUI will be slower.
+Hardware detection writes **`overrides/compute.yml`**. The `compose` wrapper runs detection before commands. **No GPU:** use a minimal service set (`./compose up -d llamacpp dashboard open-webui`); ComfyUI will be slower.
 
 ### Architecture
 
@@ -171,7 +171,7 @@ Tailnet device → Caddy :443 (TLS) → oauth2-proxy (Google SSO + email allowli
                                           ├── /comfy/    → ComfyUI
                                           └── /hermes/   → Hermes dashboard
                                                   │
-                                                  ├── Model Gateway → LiteLLM → llama.cpp / Ollama / (vLLM)
+                                                  ├── Model Gateway → LiteLLM → llama.cpp
                                                   ├── MCP Gateway → shared tools (SearXNG, n8n, ComfyUI, …)
                                                   └── Ops Controller → Docker Compose lifecycle (token-auth, no host port)
 ```
@@ -180,7 +180,7 @@ Local-first AI; operator-deployed front door. Dashboard does not mount `docker.s
 
 ### Data
 
-Bind mounts only. Set **`BASE_PATH`** (and optionally **`DATA_PATH`**). Ollama blobs under **`models/ollama`**. See [docs/data.md](docs/data.md).
+Bind mounts only. Set **`BASE_PATH`** (and optionally **`DATA_PATH`**). See [docs/data.md](docs/data.md).
 
 ### MCP (Model Context Protocol)
 
@@ -231,7 +231,7 @@ Optional: `DOCTOR_DEPS_TIMEOUT_SEC`; `DASHBOARD_AUTH_TOKEN` from `.env` when pro
 ## Troubleshooting
 
 1. **Services won’t start or images are stale** — Rebuild affected images and recreate, e.g. `docker compose build dashboard model-gateway` (or the `compose` wrapper), then `up -d`. Doctor **WARN** on missing `/api/dependencies` or `/ready` often indicates an old image.
-2. **Doctor warns on Ollama (11434) or MCP (8811)** — Expected if those ports are not published; use `overrides/ollama-expose.yml` / `overrides/mcp-expose.yml` or set `DOCTOR_STRICT=1` only when you intend strict probes (see doctor script comments in repo).
+2. **Doctor warns on MCP (8811)** — Expected if that port is not published; use `overrides/mcp-expose.yml` or set `DOCTOR_STRICT=1` only when you intend strict probes (see doctor script comments in repo).
 3. **No GPU** — Use a minimal service set or CPU-oriented overrides; ComfyUI will be slower.
 4. **Exposing to a network** — Enable **Open WebUI** auth (`WEBUI_AUTH=True`), set `DASHBOARD_AUTH_TOKEN`, and harden **n8n** — see [SECURITY.md](SECURITY.md).
 

diff --git a/SECURITY.md b/SECURITY.md
@@ -71,4 +71,4 @@ All runtime data is stored under `BASE_PATH/data/` via bind mounts. Ensure appro
 1. **Reset OPS_CONTROLLER_TOKEN:** Generate new token, update `.env`, restart dashboard + ops-controller
 2. **Restore data:** Restore `data/` from a local backup
 3. **Disable MCP tools:** Clear `data/mcp/servers.txt` or set to a single safe server
-4. **Safe mode:** Stop `mcp-gateway` and `hermes-gateway`; use `ollama` + `open-webui` only
+4. **Safe mode:** Stop `mcp-gateway` and `hermes-gateway`; use `llamacpp` + `open-webui` only
diff --git a/compose b/compose
@@ -4,13 +4,11 @@
 #
 # Examples:
 #   ./compose up -d                                             # start all services
-#   ./compose up -d ollama dashboard open-webui                # start core only
+#   ./compose up -d llamacpp dashboard open-webui              # start core only
 #   ./compose down                                             # stop all
-#   ./compose logs -f ollama                                   # tail logs
-#   ./compose run --rm model-puller                            # pull Ollama models
+#   ./compose logs -f llamacpp                                 # tail logs
 #
 # Compose overrides (in overrides/):
-#   ./compose -f docker-compose.yml -f overrides/ollama-expose.yml up -d
 #   ./compose -f docker-compose.yml -f overrides/vllm.yml --profile vllm up -d
 set -e
 

diff --git a/compose.ps1 b/compose.ps1
@@ -3,13 +3,11 @@
 #
 # Examples:
 #   .\compose.ps1 up -d                                             # start all services
-#   .\compose.ps1 up -d ollama dashboard open-webui                # start core only
+#   .\compose.ps1 up -d llamacpp dashboard open-webui              # start core only
 #   .\compose.ps1 down                                             # stop all
-#   .\compose.ps1 logs -f ollama                                   # tail logs
-#   .\compose.ps1 run --rm model-puller                            # pull Ollama models
+#   .\compose.ps1 logs -f llamacpp                                 # tail logs
 #
 # Compose overrides (in overrides/):
-#   .\compose.ps1 -f docker-compose.yml -f overrides/ollama-expose.yml up -d
 #   .\compose.ps1 -f docker-compose.yml -f overrides/vllm.yml --profile vllm up -d
 
 param([Parameter(ValueFromRemainingArguments)][string[]]$PassThrough)

diff --git a/docs/GETTING_STARTED.md b/docs/GETTING_STARTED.md
@@ -6,11 +6,11 @@ Quick paths to common workflows for a single homelab operator. The stack assumes
 
 ### I want to chat
 
-1. Start: `docker compose up -d caddy oauth2-proxy ollama dashboard open-webui`
+1. Start: `docker compose up -d caddy oauth2-proxy llamacpp dashboard open-webui`
 2. Pull a model via the dashboard (`https://${CADDY_TAILNET_HOSTNAME}/dash/` → Starter pack, or pick one)
 3. Open `https://${CADDY_TAILNET_HOSTNAME}/` — Open WebUI
 
-No GPU required for chat (Ollama runs on CPU, slower but works).
+No GPU required for chat (llama.cpp runs on CPU, slower but works).
 
 ### I want to generate images (LTX-2)
 
@@ -20,7 +20,7 @@ No GPU required for chat (Ollama runs on CPU, slower but works).
 
 ### I want workflow automation
 
-1. Start: `docker compose up -d caddy oauth2-proxy ollama n8n`
+1. Start: `docker compose up -d caddy oauth2-proxy llamacpp n8n`
 2. Open `https://${CADDY_TAILNET_HOSTNAME}/n8n/` — n8n
 
 ### Full stack
@@ -35,7 +35,7 @@ Alternatively: `docker compose up -d` — same services without the full bootstr
 
 Use local files as context in **Open WebUI** via Qdrant + the `rag-ingestion` service.
 
-1. **Pull the embedding model** (once): use the dashboard or `docker compose run --rm model-puller` so **`nomic-embed-text`** (or your `EMBED_MODEL`) is available in Ollama.
+1. **Provide the embedding model** (once): place the embedding GGUF (**`nomic-embed-text`**, or your `EMBED_MODEL`) under `models/gguf/` so the `llamacpp-embed` service can serve it.
 2. **Start the RAG profile** (adds Qdrant + `rag-ingestion`):
    ```bash
    docker compose --profile rag up -d
@@ -48,15 +48,12 @@ Env knobs (optional, in `.env`): `EMBED_MODEL`, `RAG_COLLECTION`, `RAG_CHUNK_SIZ
 
 **Optional — [Agentic Design Patterns](https://github.com/Mathews-Tom/Agentic-Design-Patterns) (MIT book text):** clone or copy the `.md` tree into `data/rag-input/` (for example `git clone --depth 1 https://github.com/Mathews-Tom/Agentic-Design-Patterns.git data/rag-input/agentic-design-patterns`), then run the steps above so `rag-ingestion` can index it.
 
-### Direct Ollama (Cursor, CLI on the host machine)
+### Host tools (Cursor, CLI on the host machine)
 
-By default Ollama is backend-only (no host port — host MCP clients should go through `127.0.0.1:11435` model-gateway instead). To expose Ollama directly on the host for tools that speak Ollama's native API:
+The llama.cpp backend is internal (no host port). Host tools reach the models through the model-gateway's OpenAI-compatible API on `127.0.0.1:11435`:
 
-- Start with the Ollama-expose override:
-  `docker compose -f docker-compose.yml -f overrides/ollama-expose.yml up -d`
-- Use `http://localhost:11434` in Cursor or run `ollama run <model>` locally.
-
-Note: this exposes Ollama on `127.0.0.1` to the host machine only — not to the tailnet. Tailnet peers reach models through the SSO-gated front door (Open WebUI at `/`, or via the dashboard's model surface).
+- Point Cursor or any OpenAI-compatible client at `http://localhost:11435/v1`.
+- This is bound to `127.0.0.1` on the host machine only — not to the tailnet. Tailnet peers reach models through the SSO-gated front door (Open WebUI at `/`, or via the dashboard's model surface).
 
 ### Optional: vLLM (OpenAI-compatible server)
 

diff --git a/docs/configuration.md b/docs/configuration.md
@@ -18,7 +18,7 @@ Copy `.env.example` to `.env` and set at least `BASE_PATH`. Everything else has
 |---|---|---|
 | `DATA_PATH` | `${BASE_PATH}/data` | Override data directory location |
 | `DEFAULT_MODEL` | `local-chat` | Canonical model alias used by Open WebUI, Hermes, and LiteLLM |
-| `MODELS` | *(see `.env.example`)* | Comma-separated Ollama models to pull on first start |
+| `GGUF_MODELS` | *(see `.env.example`)* | Hugging Face repo(s) of GGUF files to pull for llama.cpp (`docker compose --profile models run --rm gguf-puller`) |
 | `OPS_CONTROLLER_TOKEN` | *(empty)* | Required for dashboard-driven service lifecycle (`openssl rand -hex 32`) |
 | `DASHBOARD_AUTH_TOKEN` | *(empty)* | Optional Bearer auth on dashboard `/api/*` |
 | `HF_TOKEN` | *(empty)* | Hugging Face token for gated model downloads |
@@ -217,7 +217,6 @@ All `data/` and `models/` directories are bind-mounted and persist across contai
 | `data/mcp/` | `servers.txt`, `registry.json`, `registry-custom.yaml` |
 | `data/dashboard/` | Dashboard throughput / benchmark data |
 | `data/comfyui-storage/` | ComfyUI outputs, custom nodes, local configs |
-| `models/ollama/` | Ollama model blobs |
 | `models/gguf/` | llama.cpp GGUF files |
 | `models/comfyui/` | ComfyUI checkpoints, LoRAs, VAEs, encoders |
 
@@ -234,7 +233,6 @@ All `data/` and `models/` directories are bind-mounted and persist across contai
 | n8n | `5678` | Workflow automation |
 | Hermes dashboard | `9119` | Overridable via `HERMES_DASHBOARD_PORT` |
 | MCP Gateway | `8811` | Published on host so external clients (Cursor, Claude Desktop) can reach it |
-| Ollama | `11434` | **Backend-only by default.** Expose via `overrides/ollama-expose.yml` |
 | Qdrant | `6333` | RAG profile only |
 | Ops Controller | internal `9000` | Not published on the host |
 
@@ -244,7 +242,7 @@ All `data/` and `models/` directories are bind-mounted and persist across contai
 
 ```json
 {"timestamp":"2026-03-22T10:00:00Z","action":"model_pulled","model":"qwen3:8b","status":"success"}
-{"timestamp":"2026-03-22T10:01:00Z","action":"service_started","service":"ollama","status":"success"}
+{"timestamp":"2026-03-22T10:01:00Z","action":"service_started","service":"llamacpp","status":"success"}
 ```
 
 ## Minimal `.env`

diff --git a/docs/data.md b/docs/data.md
@@ -13,7 +13,6 @@ Reference for where data lives, how it moves, and what survives a restart / rebu
 | `data/mcp/registry.json` | MCP server metadata, `allow_clients`, rate limits | `mcp-gateway`, dashboard |
 | `data/mcp/registry-custom.yaml` | Custom catalog fragment (e.g. ComfyUI MCP) | `mcp-gateway` |
 | `data/rag-input/` | Drop zone for RAG documents | `rag-ingestion` watch directory |
-| `models/ollama/` | Ollama model blobs | `ollama` bind mount |
 | `models/gguf/` | llama.cpp GGUF files | `llamacpp` / `llamacpp-embed` bind mount |
 | `models/comfyui/` | ComfyUI checkpoints, LoRAs, VAEs, encoders | `comfyui` bind mount |
 
@@ -36,7 +35,7 @@ Reference for where data lives, how it moves, and what survives a restart / rebu
 
 ```json
 {"timestamp":"2026-03-22T10:00:00Z","action":"model_pulled","model":"qwen3:8b","status":"success"}
-{"timestamp":"2026-03-22T10:01:00Z","action":"service_started","service":"ollama","status":"success"}
+{"timestamp":"2026-03-22T10:01:00Z","action":"service_started","service":"llamacpp","status":"success"}
 ```
 
 | Field | Type | Description |
@@ -108,9 +107,7 @@ All directories created this way persist across restarts and rebuilds.
 
 ### Model Pull
 
-**Ollama:** `docker compose run --rm model-puller` reads `MODELS` from `.env` and pulls each into `models/ollama/`. Also exposed from the dashboard.
-
-**llama.cpp GGUF:** `docker compose --profile models run --rm gguf-puller` with `GGUF_MODELS=org/repo` fetches GGUF files into `models/gguf/`.
+**llama.cpp GGUF:** `docker compose --profile models run --rm gguf-puller` with `GGUF_MODELS=org/repo` fetches GGUF files into `models/gguf/`. Also exposed from the dashboard.
 
 **ComfyUI:** `docker compose run --rm comfyui-model-puller` downloads the pack defined by `COMFYUI_PACKS` (default includes LTX-2 variants) into `models/comfyui/`. First run can be tens of GB.
 
@@ -145,7 +142,6 @@ Hermes maintains its own state under `data/hermes/` — session records, Discord
 | `data/dashboard/` | Throughput / benchmarks | yes | yes |
 | `data/comfyui-storage/` | ComfyUI outputs + custom nodes | yes | yes |
 | `data/n8n-data/` | n8n workflows | yes | yes |
-| `models/ollama/` | Ollama blobs | yes | yes |
 | `models/gguf/` | llama.cpp GGUF files | yes | yes |
 | `models/comfyui/` | ComfyUI weights | yes | yes |
 
@@ -161,7 +157,7 @@ Hermes maintains its own state under `data/hermes/` — session records, Discord
 ### What to back up
 
 1. `data/hermes/` — agent state
-2. `models/ollama/`, `models/gguf/`, `models/comfyui/` — expensive to re-download
+2. `models/gguf/`, `models/comfyui/` — expensive to re-download
 3. `data/ops-controller/audit.log*` — audit history
 4. `data/qdrant/` — RAG collection
 5. `.env` — environment configuration (**do not commit**)
@@ -210,13 +206,13 @@ docker compose up -d
 | `data/ops-controller/audit.log` | Archive rotated files (`audit.log.1` etc.) | Monthly |
 | `data/rag-input/` | Remove processed files | As needed |
 | `data/comfyui-storage/output/` | Prune old outputs | As needed |
-| `models/ollama/` | Remove unused models | Quarterly |
+| `models/gguf/` | Remove unused models | Quarterly |
 
 ```bash
 # Archive current audit log
 mv data/ops-controller/audit.log data/ops-controller/audit.log.$(date +%Y%m%d)
 
-# Prune Ollama
-docker compose exec ollama ollama list
-docker compose exec ollama ollama rm <model-name>
+# Prune GGUF models (delete unused GGUF files)
+ls models/gguf/
+rm models/gguf/<model-file>.gguf
 ```
diff --git a/docs/product requirements docs/appendix-env-vars.md b/docs/product requirements docs/appendix-env-vars.md
@@ -4,9 +4,8 @@
 |----------|---------|-------------|---------|
 | `BASE_PATH` | compose | Project root path | `.` |
 | `DATA_PATH` | compose | Data directory | `${BASE_PATH}/data` |
-| `OLLAMA_URL` | model-gateway, dashboard | Ollama internal URL | `http://ollama:11434` |
+| `LLAMACPP_URL` | model-gateway, dashboard | llama.cpp internal URL | `http://llamacpp:8080` |
 | `VLLM_URL` | model-gateway | vLLM internal URL (optional) | *(empty)* |
-| `DEFAULT_PROVIDER` | model-gateway | Provider for unprefixed models | `ollama` |
 | `MODEL_CACHE_TTL_SEC` | model-gateway | Model list cache TTL seconds | `60` |
 | `DASHBOARD_URL` | model-gateway | Dashboard for throughput recording | `http://dashboard:8080` |
 | `OPS_CONTROLLER_URL` | dashboard | Ops controller URL | `http://ops-controller:9000` |
@@ -20,7 +19,7 @@
 | `MODEL_GATEWAY_PORT` | model-gateway | Model gateway host port | `11435` |
 | `WEBUI_AUTH` | open-webui | Enable Open WebUI auth | `False` (target `True` in M6) |
 | `OPENAI_API_BASE` | open-webui, n8n | OpenAI-compat base URL | `http://model-gateway:11435/v1` |
-| `MODELS` | model-puller | Models to pull on startup | `deepseek-r1:7b,...` |
+| `GGUF_MODELS` | gguf-puller | Hugging Face repo(s) of GGUF files to pull | *(empty)* |
 | `COMPUTE_MODE` | compose | CPU/nvidia/amd | auto-detected |
 | `QDRANT_PORT` | qdrant | Qdrant host port | `6333` |
 | `EMBED_MODEL` | rag-ingestion | Embedding model for RAG | `nomic-embed-text` |

diff --git a/docs/product requirements docs/appendix-quality-bar.md b/docs/product requirements docs/appendix-quality-bar.md
@@ -19,7 +19,7 @@
 ## Performance Targets
 
 - Model list (cached): `<100ms` after first call
-- Model list (cold): `<2s` when Ollama healthy
+- Model list (cold): `<2s` when llama.cpp healthy
 - RAG embedding: `<5s` per document chunk (depends on model)
 - Tool invocation: `<30s` default timeout
 - Ops restart: `<60s` for most services
@@ -42,4 +42,4 @@
 3. Disable all tools: `echo "" > data/mcp/servers.txt`
 4. Invalidate model cache: `curl -X DELETE http://localhost:11435/v1/cache`
 5. Disable unsafe services: `docker compose stop mcp-gateway hermes-gateway comfyui rag-ingestion`
-6. Safe mode: `docker compose up -d ollama model-gateway dashboard open-webui qdrant`
+6. Safe mode: `docker compose up -d llamacpp model-gateway dashboard open-webui qdrant`
diff --git a/docs/product requirements docs/appendix-rollback.md b/docs/product requirements docs/appendix-rollback.md
@@ -1,11 +1,11 @@
 # Appendix: Rollback Procedures
 
-1. **Model gateway:** Point services directly to Ollama (`OLLAMA_BASE_URL=http://ollama:11434`); `docker compose stop model-gateway`. Restart affected services.
+1. **Model gateway:** Point services directly to llama.cpp (`OPENAI_API_BASE=http://llamacpp:8080/v1`); `docker compose stop model-gateway`. Restart affected services.
 2. **Ops controller:** Remove controller from compose or set no token; ops buttons show "unavailable" in dashboard. No data loss.
 3. **MCP registry:** Delete `registry.json`; dashboard falls back to `servers.txt` only. Policy metadata disabled.
 4. **cap_drop / read_only:** Remove from compose; `docker compose up -d --force-recreate <service>`.
 5. **Reset OPS_CONTROLLER_TOKEN:** `openssl rand -hex 32` → update `.env` → `docker compose up -d dashboard ops-controller`.
 6. **MCP tools:** Clear `data/mcp/servers.txt` or set to single safe server → gateway hot-reloads within 10s.
 7. **RAG:** `docker compose stop rag-ingestion qdrant`; remove `VECTOR_DB=qdrant` from Open WebUI env → Open WebUI uses built-in vector store. Qdrant data preserved in `data/qdrant/`.
-8. **Invalidate model cache:** `curl -X DELETE http://localhost:11435/v1/cache` — forces fresh fetch from Ollama on next `/v1/models` call.
-9. **Safe mode:** `docker compose stop mcp-gateway hermes-gateway comfyui rag-ingestion` → Ollama + Open WebUI + dashboard only.
+8. **Invalidate model cache:** `curl -X DELETE http://localhost:11435/v1/cache` — forces fresh fetch from llama.cpp on next `/v1/models` call.
+9. **Safe mode:** `docker compose stop mcp-gateway hermes-gateway comfyui rag-ingestion` → llama.cpp + Open WebUI + dashboard only.