diff --git a/compose b/compose index d9d242b..42353f8 100755 --- a/compose +++ b/compose @@ -7,9 +7,6 @@ # ./compose up -d llamacpp dashboard open-webui # start core only # ./compose down # stop all # ./compose logs -f llamacpp # tail logs -# -# Compose overrides (in overrides/): -# ./compose -f docker-compose.yml -f overrides/vllm.yml --profile vllm up -d set -e if [[ $# -eq 0 || "$1" == "--help" || "$1" == "-h" ]]; then diff --git a/compose.ps1 b/compose.ps1 index a306e73..f6e409c 100644 --- a/compose.ps1 +++ b/compose.ps1 @@ -6,9 +6,6 @@ # .\compose.ps1 up -d llamacpp dashboard open-webui # start core only # .\compose.ps1 down # stop all # .\compose.ps1 logs -f llamacpp # tail logs -# -# Compose overrides (in overrides/): -# .\compose.ps1 -f docker-compose.yml -f overrides/vllm.yml --profile vllm up -d param([Parameter(ValueFromRemainingArguments)][string[]]$PassThrough) diff --git a/docs/GETTING_STARTED.md b/docs/GETTING_STARTED.md index 4fa052d..c74d293 100644 --- a/docs/GETTING_STARTED.md +++ b/docs/GETTING_STARTED.md @@ -55,18 +55,6 @@ The llama.cpp backend is internal (no host port). Host tools reach the models th - Point Cursor or any OpenAI-compatible client at `http://localhost:11435/v1`. - This is bound to `127.0.0.1` on the host machine only — not to the tailnet. Tailnet peers reach models through the SSO-gated front door (Open WebUI at `/`, or via the dashboard's model surface). -### Optional: vLLM (OpenAI-compatible server) - -Use vLLM as an additional model provider (e.g. for Llama, Mistral via Hugging Face): - -1. Start with the vLLM profile: - `docker compose -f docker-compose.yml -f overrides/vllm.yml --profile vllm up -d` -2. Set in `.env`: `VLLM_URL=http://vllm:8000` -3. Restart model-gateway: `docker compose restart model-gateway` -4. In clients (Open WebUI, Hermes), choose models with prefix `vllm/` (e.g. `vllm/meta-llama/Llama-3.2-3B-Instruct`). - -See [overrides/vllm.yml](../overrides/vllm.yml) for `VLLM_MODEL` and resource limits. - ## Tailscale + SSO front door Single homelab operator with a small Google-account allowlist for friends / family / co-workers — that's the deployment model. UI services don't publish host ports; everything goes through Caddy on the tailnet. diff --git a/docs/product requirements docs/appendix-env-vars.md b/docs/product requirements docs/appendix-env-vars.md index 977eadc..88d848c 100644 --- a/docs/product requirements docs/appendix-env-vars.md +++ b/docs/product requirements docs/appendix-env-vars.md @@ -5,7 +5,6 @@ | `BASE_PATH` | compose | Project root path | `.` | | `DATA_PATH` | compose | Data directory | `${BASE_PATH}/data` | | `LLAMACPP_URL` | model-gateway, dashboard | llama.cpp internal URL | `http://llamacpp:8080` | -| `VLLM_URL` | model-gateway | vLLM internal URL (optional) | *(empty)* | | `MODEL_CACHE_TTL_SEC` | model-gateway | Model list cache TTL seconds | `60` | | `DASHBOARD_URL` | model-gateway | Dashboard for throughput recording | `http://dashboard:8080` | | `OPS_CONTROLLER_URL` | dashboard | Ops controller URL | `http://ops-controller:9000` | diff --git a/docs/product requirements docs/architecture-and-principles.md b/docs/product requirements docs/architecture-and-principles.md index 0b40057..1a4221b 100644 --- a/docs/product requirements docs/architecture-and-principles.md +++ b/docs/product requirements docs/architecture-and-principles.md @@ -57,10 +57,9 @@ │ │ │ servers.txt │ │ → ops ctrl API │ │ data/rag- │ │ │ │ │ │ registry.json │ │ registry.json │ │ input/ │ │ │ │ │ └─────────────────┘ └─────────────────┘ └──────────────┘ │ │ -│ │ ┌─────────────────┐ ┌─────────────────┐ │ │ -│ │ │ vLLM (opt) │ │ ComfyUI :8188 │ │ │ -│ │ │ overrides/ │ │ (frontend net) │ │ │ -│ │ │ vllm.yml │ └─────────────────┘ │ │ +│ │ ┌─────────────────┐ │ │ +│ │ │ ComfyUI :8188 │ │ │ +│ │ │ (frontend net) │ │ │ │ │ └─────────────────┘ │ │ │ └──────────────────────────────────────────────────────────────────────────┘ │ └────────────────────────────────────────────────────────────────────────────────┘ diff --git a/docs/product requirements docs/component-model-gateway.md b/docs/product requirements docs/component-model-gateway.md index d9de5ce..ae5c6ce 100644 --- a/docs/product requirements docs/component-model-gateway.md +++ b/docs/product requirements docs/component-model-gateway.md @@ -69,29 +69,6 @@ model-gateway: - DASHBOARD_URL=http://dashboard:8080 ``` -### vLLM Compose Profile (Optional) - -```yaml -# overrides/vllm.yml -services: - vllm: - profiles: [vllm] - image: vllm/vllm-openai:latest - ports: - - "8000:8000" - environment: - - MODEL=${VLLM_MODEL:-meta-llama/Llama-3.2-3B-Instruct} - deploy: - resources: - limits: - memory: 16G - reservations: - devices: - - driver: nvidia - count: 1 - capabilities: [gpu] -``` - ## Non-Goals - Direct UI rendering. UI components are separate and consume the gateway. - Persistent storage of model results — the gateway only forwards results. diff --git a/docs/product requirements docs/index.md b/docs/product requirements docs/index.md index a4ffeb0..610eba4 100644 --- a/docs/product requirements docs/index.md +++ b/docs/product requirements docs/index.md @@ -41,7 +41,6 @@ A self-hosted AI platform that any developer can run with `./compose up -d`. Cor | llama.cpp backend-only (no host port) | Live | `docker-compose.yml` | | SSRF egress block scripts | Live | `scripts/ssrf-egress-block.sh`, `.ps1` | | Hermes agent (gateway + dashboard) | Live | `docker-compose.yml`, `hermes/` | -| vLLM optional compose profile | Live | `overrides/vllm.yml` | | Contract + smoke tests | Live | `tests/` | ## Open Risks diff --git a/docs/product requirements docs/milestones-and-roadmap.md b/docs/product requirements docs/milestones-and-roadmap.md index e376208..4c7947a 100644 --- a/docs/product requirements docs/milestones-and-roadmap.md +++ b/docs/product requirements docs/milestones-and-roadmap.md @@ -8,7 +8,7 @@ | **M1** | Done | Model Gateway: OpenAI-compat, llama.cpp, streaming, embeddings, throughput | | **M2** | Done | Ops Controller: start/stop/restart/logs/pull/audit; dashboard calls controller; bearer auth | | **M3** | Done | MCP registry.json + health API; cap_drop/read_only hardening; model list cache; Open WebUI → gateway default | -| **M4** | Done | Explicit Docker networks (frontend/backend); correlation IDs (X-Request-ID → audit); vLLM compose profile; smoke tests | +| **M4** | Done | Explicit Docker networks (frontend/backend); correlation IDs (X-Request-ID → audit); smoke tests | | **M5** | Done | Dashboard MCP health dots (green/yellow/red); SSRF egress scripts; hardware stats; throughput benchmark; default-model management | | **M5-ext** | Done | RAG pipeline (Qdrant + rag-ingestion); Open WebUI → Qdrant; RAG status endpoint; Responses API + completions compat; cache-bust endpoint | | **M6** | Partial | **Done:** mcp-gateway backend-only; CI; audit log rotation. **Deferred:** MCP per-client / `X-Client-ID` (upstream). **Skipped:** `WEBUI_AUTH` default → True | @@ -30,12 +30,11 @@ --- -## M4 — Networks + Correlation + vLLM + Smoke Tests (Done) +## M4 — Networks + Correlation + Smoke Tests (Done) **User-visible outcomes:** - Explicit `ordo-ai-stack-frontend` / `ordo-ai-stack-backend` networks; llama.cpp/ops-controller on backend only - Request IDs: `X-Request-ID` forwarded dashboard → ops-controller and stored in audit entries -- vLLM: `overrides/vllm.yml` with profile `vllm` - Smoke tests: `tests/test_compose_smoke.py` --- diff --git a/docs/product requirements docs/risks-and-questions.md b/docs/product requirements docs/risks-and-questions.md index aa3eec3..985b9f3 100644 --- a/docs/product requirements docs/risks-and-questions.md +++ b/docs/product requirements docs/risks-and-questions.md @@ -23,7 +23,6 @@ | 3 | **MCP gateway policy:** Does Docker MCP Gateway support `X-Client-ID` for per-client allowlist? | Open — not yet; deferred to M6 | | 5 | **llama.cpp host port:** Remove to reduce attack surface? | Resolved — backend-only; no host port | | 6 | **Audit log rotation** | Resolved — size-based rotation (`AUDIT_LOG_MAX_BYTES`) | -| 7 | **vLLM timing** | Resolved — `overrides/vllm.yml` with `--profile vllm` | | 8 | **ComfyUI non-root** | Open — `yanwk/comfyui-boot` runs as root; image limitation | | 9 | **Smoke test in CI** | Resolved — see `.github/workflows/ci.yml` | | 10 | **N8N LLM node** | Open — use OpenAI-compat node with `baseURL: http://model-gateway:11435/v1`; needs example workflow doc | diff --git a/tests/test_compose_smoke.py b/tests/test_compose_smoke.py index b4436db..a89afcf 100644 --- a/tests/test_compose_smoke.py +++ b/tests/test_compose_smoke.py @@ -1,6 +1,6 @@ """Compose config and optional runtime smoke tests. -- Config tests: validate docker-compose.yml (and optional vllm override) parse and merge. +- Config tests: validate docker-compose.yml parses and merges. - Runtime smoke: set RUN_COMPOSE_SMOKE=1 to run 'compose up -d' and assert key services healthy (requires Docker daemon; use in CI or locally). """ @@ -15,7 +15,6 @@ REPO_ROOT = Path(__file__).resolve().parent.parent COMPOSE_FILE = REPO_ROOT / "docker-compose.yml" -COMPOSE_VLLM = REPO_ROOT / "overrides" / "vllm.yml" # Services that must be healthy for "smoke" (long-running core stack) SMOKE_SERVICES = ["llamacpp", "llamacpp-embed", "model-gateway", "dashboard"] @@ -33,8 +32,6 @@ def _compose_cmd(*args, extra_env=None, timeout=120): cmd = ["docker", "compose", "-f", str(COMPOSE_FILE)] - if COMPOSE_VLLM.exists(): - cmd += ["-f", str(COMPOSE_VLLM)] cmd += list(args) env = {**os.environ, **_COMPOSE_REQUIRED_PLACEHOLDERS, **(extra_env or {})} return subprocess.run( @@ -62,13 +59,6 @@ def test_compose_config_includes_networks(): assert "ordo-ai-stack-backend" in out or "backend" in out -@pytest.mark.skipif(not COMPOSE_VLLM.exists(), reason="overrides/vllm.yml not present") -def test_compose_vllm_override_config_valid(): - """With vllm override, compose config still valid (vllm profile).""" - r = _compose_cmd("config", "--quiet", extra_env={"COMPOSE_PROFILES": "vllm"}) - assert r.returncode == 0, f"vllm config failed: {r.stderr or r.stdout}" - - def _has_nvidia_gpu() -> bool: """Return True iff `nvidia-smi` is available and exits 0.""" try: