AlienWalker1995 · AlienWalker1995 · Jul 1, 2026 · Jul 1, 2026
diff --git a/compose b/compose
@@ -7,9 +7,6 @@
 #   ./compose up -d llamacpp dashboard open-webui              # start core only
 #   ./compose down                                             # stop all
 #   ./compose logs -f llamacpp                                 # tail logs
-#
-# Compose overrides (in overrides/):
-#   ./compose -f docker-compose.yml -f overrides/vllm.yml --profile vllm up -d
 set -e
 
 if [[ $# -eq 0 || "$1" == "--help" || "$1" == "-h" ]]; then

diff --git a/compose.ps1 b/compose.ps1
@@ -6,9 +6,6 @@
 #   .\compose.ps1 up -d llamacpp dashboard open-webui              # start core only
 #   .\compose.ps1 down                                             # stop all
 #   .\compose.ps1 logs -f llamacpp                                 # tail logs
-#
-# Compose overrides (in overrides/):
-#   .\compose.ps1 -f docker-compose.yml -f overrides/vllm.yml --profile vllm up -d
 
 param([Parameter(ValueFromRemainingArguments)][string[]]$PassThrough)
 

diff --git a/docs/GETTING_STARTED.md b/docs/GETTING_STARTED.md
@@ -55,18 +55,6 @@ The llama.cpp backend is internal (no host port). Host tools reach the models th
 - Point Cursor or any OpenAI-compatible client at `http://localhost:11435/v1`.
 - This is bound to `127.0.0.1` on the host machine only — not to the tailnet. Tailnet peers reach models through the SSO-gated front door (Open WebUI at `/`, or via the dashboard's model surface).
 
-### Optional: vLLM (OpenAI-compatible server)
-
-Use vLLM as an additional model provider (e.g. for Llama, Mistral via Hugging Face):
-
-1. Start with the vLLM profile:
-   `docker compose -f docker-compose.yml -f overrides/vllm.yml --profile vllm up -d`
-2. Set in `.env`: `VLLM_URL=http://vllm:8000`
-3. Restart model-gateway: `docker compose restart model-gateway`
-4. In clients (Open WebUI, Hermes), choose models with prefix `vllm/<model-id>` (e.g. `vllm/meta-llama/Llama-3.2-3B-Instruct`).
-
-See [overrides/vllm.yml](../overrides/vllm.yml) for `VLLM_MODEL` and resource limits.
-
 ## Tailscale + SSO front door
 
 Single homelab operator with a small Google-account allowlist for friends / family / co-workers — that's the deployment model. UI services don't publish host ports; everything goes through Caddy on the tailnet.

diff --git a/docs/product requirements docs/appendix-env-vars.md b/docs/product requirements docs/appendix-env-vars.md
@@ -5,7 +5,6 @@
 | `BASE_PATH` | compose | Project root path | `.` |
 | `DATA_PATH` | compose | Data directory | `${BASE_PATH}/data` |
 | `LLAMACPP_URL` | model-gateway, dashboard | llama.cpp internal URL | `http://llamacpp:8080` |
-| `VLLM_URL` | model-gateway | vLLM internal URL (optional) | *(empty)* |
 | `MODEL_CACHE_TTL_SEC` | model-gateway | Model list cache TTL seconds | `60` |
 | `DASHBOARD_URL` | model-gateway | Dashboard for throughput recording | `http://dashboard:8080` |
 | `OPS_CONTROLLER_URL` | dashboard | Ops controller URL | `http://ops-controller:9000` |

diff --git a/docs/product requirements docs/architecture-and-principles.md b/docs/product requirements docs/architecture-and-principles.md
@@ -57,10 +57,9 @@
 │  │  │ servers.txt     │  │ → ops ctrl API   │  │ data/rag-    │             │  │
 │  │  │ registry.json   │  │ registry.json    │  │ input/       │             │  │
 │  │  └─────────────────┘  └─────────────────┘  └──────────────┘             │  │
-│  │  ┌─────────────────┐  ┌─────────────────┐                               │  │
-│  │  │ vLLM (opt)      │  │ ComfyUI :8188   │                               │  │
-│  │  │ overrides/      │  │ (frontend net)  │                               │  │
-│  │  │ vllm.yml        │  └─────────────────┘                               │  │
+│  │  ┌─────────────────┐                                                     │  │
+│  │  │ ComfyUI :8188   │                                                     │  │
+│  │  │ (frontend net)  │                                                     │  │
 │  │  └─────────────────┘                                                     │  │
 │  └──────────────────────────────────────────────────────────────────────────┘  │
 └────────────────────────────────────────────────────────────────────────────────┘

diff --git a/docs/product requirements docs/component-model-gateway.md b/docs/product requirements docs/component-model-gateway.md
@@ -69,29 +69,6 @@ model-gateway:
     - DASHBOARD_URL=http://dashboard:8080
 ```
 
-### vLLM Compose Profile (Optional)
-
-```yaml
-# overrides/vllm.yml
-services:
-  vllm:
-    profiles: [vllm]
-    image: vllm/vllm-openai:latest
-    ports:
-      - "8000:8000"
-    environment:
-      - MODEL=${VLLM_MODEL:-meta-llama/Llama-3.2-3B-Instruct}
-    deploy:
-      resources:
-        limits:
-          memory: 16G
-        reservations:
-          devices:
-            - driver: nvidia
-              count: 1
-              capabilities: [gpu]
-```
-
 ## Non-Goals
 - Direct UI rendering. UI components are separate and consume the gateway.
 - Persistent storage of model results — the gateway only forwards results.

diff --git a/docs/product requirements docs/index.md b/docs/product requirements docs/index.md
@@ -41,7 +41,6 @@ A self-hosted AI platform that any developer can run with `./compose up -d`. Cor
 | llama.cpp backend-only (no host port) | Live | `docker-compose.yml` |
 | SSRF egress block scripts | Live | `scripts/ssrf-egress-block.sh`, `.ps1` |
 | Hermes agent (gateway + dashboard) | Live | `docker-compose.yml`, `hermes/` |
-| vLLM optional compose profile | Live | `overrides/vllm.yml` |
 | Contract + smoke tests | Live | `tests/` |
 
 ## Open Risks

diff --git a/docs/product requirements docs/milestones-and-roadmap.md b/docs/product requirements docs/milestones-and-roadmap.md
@@ -8,7 +8,7 @@
 | **M1** | Done | Model Gateway: OpenAI-compat, llama.cpp, streaming, embeddings, throughput |
 | **M2** | Done | Ops Controller: start/stop/restart/logs/pull/audit; dashboard calls controller; bearer auth |
 | **M3** | Done | MCP registry.json + health API; cap_drop/read_only hardening; model list cache; Open WebUI → gateway default |
-| **M4** | Done | Explicit Docker networks (frontend/backend); correlation IDs (X-Request-ID → audit); vLLM compose profile; smoke tests |
+| **M4** | Done | Explicit Docker networks (frontend/backend); correlation IDs (X-Request-ID → audit); smoke tests |
 | **M5** | Done | Dashboard MCP health dots (green/yellow/red); SSRF egress scripts; hardware stats; throughput benchmark; default-model management |
 | **M5-ext** | Done | RAG pipeline (Qdrant + rag-ingestion); Open WebUI → Qdrant; RAG status endpoint; Responses API + completions compat; cache-bust endpoint |
 | **M6** | Partial | **Done:** mcp-gateway backend-only; CI; audit log rotation. **Deferred:** MCP per-client / `X-Client-ID` (upstream). **Skipped:** `WEBUI_AUTH` default → True |
@@ -30,12 +30,11 @@
 
 ---
 
-## M4 — Networks + Correlation + vLLM + Smoke Tests (Done)
+## M4 — Networks + Correlation + Smoke Tests (Done)
 
 **User-visible outcomes:**
 - Explicit `ordo-ai-stack-frontend` / `ordo-ai-stack-backend` networks; llama.cpp/ops-controller on backend only
 - Request IDs: `X-Request-ID` forwarded dashboard → ops-controller and stored in audit entries
-- vLLM: `overrides/vllm.yml` with profile `vllm`
 - Smoke tests: `tests/test_compose_smoke.py`
 
 ---

diff --git a/docs/product requirements docs/risks-and-questions.md b/docs/product requirements docs/risks-and-questions.md
@@ -23,7 +23,6 @@
 | 3 | **MCP gateway policy:** Does Docker MCP Gateway support `X-Client-ID` for per-client allowlist? | Open — not yet; deferred to M6 |
 | 5 | **llama.cpp host port:** Remove to reduce attack surface? | Resolved — backend-only; no host port |
 | 6 | **Audit log rotation** | Resolved — size-based rotation (`AUDIT_LOG_MAX_BYTES`) |
-| 7 | **vLLM timing** | Resolved — `overrides/vllm.yml` with `--profile vllm` |
 | 8 | **ComfyUI non-root** | Open — `yanwk/comfyui-boot` runs as root; image limitation |
 | 9 | **Smoke test in CI** | Resolved — see `.github/workflows/ci.yml` |
 | 10 | **N8N LLM node** | Open — use OpenAI-compat node with `baseURL: http://model-gateway:11435/v1`; needs example workflow doc |

diff --git a/tests/test_compose_smoke.py b/tests/test_compose_smoke.py
@@ -1,6 +1,6 @@
 """Compose config and optional runtime smoke tests.
 
-- Config tests: validate docker-compose.yml (and optional vllm override) parse and merge.
+- Config tests: validate docker-compose.yml parses and merges.
 - Runtime smoke: set RUN_COMPOSE_SMOKE=1 to run 'compose up -d' and assert key services healthy
   (requires Docker daemon; use in CI or locally).
 """
@@ -15,7 +15,6 @@
 
 REPO_ROOT = Path(__file__).resolve().parent.parent
 COMPOSE_FILE = REPO_ROOT / "docker-compose.yml"
-COMPOSE_VLLM = REPO_ROOT / "overrides" / "vllm.yml"
 
 # Services that must be healthy for "smoke" (long-running core stack)
 SMOKE_SERVICES = ["llamacpp", "llamacpp-embed", "model-gateway", "dashboard"]
@@ -33,8 +32,6 @@
 
 def _compose_cmd(*args, extra_env=None, timeout=120):
     cmd = ["docker", "compose", "-f", str(COMPOSE_FILE)]
-    if COMPOSE_VLLM.exists():
-        cmd += ["-f", str(COMPOSE_VLLM)]
     cmd += list(args)
     env = {**os.environ, **_COMPOSE_REQUIRED_PLACEHOLDERS, **(extra_env or {})}
     return subprocess.run(
@@ -62,13 +59,6 @@ def test_compose_config_includes_networks():
     assert "ordo-ai-stack-backend" in out or "backend" in out
 
 
-@pytest.mark.skipif(not COMPOSE_VLLM.exists(), reason="overrides/vllm.yml not present")
-def test_compose_vllm_override_config_valid():
-    """With vllm override, compose config still valid (vllm profile)."""
-    r = _compose_cmd("config", "--quiet", extra_env={"COMPOSE_PROFILES": "vllm"})
-    assert r.returncode == 0, f"vllm config failed: {r.stderr or r.stdout}"
-
-
 def _has_nvidia_gpu() -> bool:
     """Return True iff `nvidia-smi` is available and exits 0."""
     try: