A collection of high-performance, production-ready LLM deployment stacks for HomeOps environments. This repository focuses on squeezing enterprise-grade throughput out of consumer-grade hardware using vLLM, FP8 quantization, and meticulously tuned Speculative Decoding
This isn't just a folder of compose files; it's the result of pushing over 47 million tokens through a single node to find the perfect balance between hardware weight and assistant reactivity. Like a Triple-Fried Egg, Chilli, and Chutney sandwich, these configurations shouldn't work this well, but they do. This setup represents the "Holly" of local inference—taking the raw compute of the GB10 Spark and turning it into something that actually feels fundamentally reactive
compose-monitoring.yaml- The observability backbone of the entire ship; contains the Prometheus, Grafana, and DCGM-Exporter stack required to baseline your VRAM and track speculative hit rates in real-time without interfering with the model runtimescompose-qwen3.6-35b.yaml- The current heavy-lifter in the fleet; tuned specifically for Qwen3.6-35B-FP8 with a dflash speculative draft model and a massive 256k context window for deep repository analysiscompose-gemma-4-26B-nvfp4.yaml- The MoE (Mixture of Experts) specialist; optimised for the Gemma-4-26B-A4B architecture with Eagle3 speculative encoding enabledprometheus.yaml- Prometheus configuration file to extract metrics from the vllm runtime and dcgm-exporter podsdashboards/- A curated set of Grafana dashboard JSON exports designed to get your metrics up and running without any manual, "smeg-headed" clicking or configuration drift
- Clone the Repo
git clone https://github.com/Kelcode-Dev/vllm-homeops-hub- Create the network
docker network create model-runtimes- Spin up Monitoring
docker compose -f compose-monitoring.yaml up -d- Deploy your Model
docker compose -f compose-qwen3.6-35b-fp8.yaml up -dAdjust your gpu-memory-utilization and max-model-len based on your specific VRAM headroom to ensure the ship doesn't do a "Lister-style" crash during heavy ingestion
The true power of this stack lies in its visibility. To move beyond "vibes" and into precise tuning, the /dashboards folder contains pre-configured Grafana JSON exports that provide a real-time "State of Grace" view of your hardware and model performance.
A comprehensive "Mission Control" view combining NVIDIA DCGM hardware metrics with vLLM engine performance:
- Hardware Vitals Real-time monitoring of GPU Utilisation, SM Clock speeds, Power Draw (Watts), and die/memory temperatures to ensure your hardware isn't running hotter than a supernova
- Engine Pressure Tracks the KV Cache usage percentage—critical for preventing OOMs during long-context repository analysis—and active preemption events that signal the engine is struggling
- Speculative Performance Dedicated panels for Draft Token Acceptance Rate and average tokens accepted per draft; a sustained drop here suggests your draft model has diverged from the target's distribution
- Latency Radar Visualises Time to First Token (TTFT) and Inter-Token Latency (p50/p99) to ensure your "Rule of 4" tuning is actually delivering the snappy, reactive experience it claims to
The "Financial Controller" for your homelab, calculating the ROI of self-hosting vs. commercial APIs:
- Token Velocity A cumulative breakdown of prompt (ISL) vs. generation (OSL) tokens over 1, 7, 14, and 30-day windows to track your overall consumption trends
- Claude 3.5 ROI Dynamically calculates "Cost Savings" by comparing your local volume against commercial pricing, defaulting to current market rates for Sonnet-tier performance
- The Electricity Reality Uses a configurable £/kWh rate to calculate the actual running cost of your GB10 Spark device—providing a blunt "Self-Host vs. API" value comparison that doesn't sugar-coat the bill
- Ensure your
compose-monitoring.ymlstack is running and Prometheus is successfully scraping your vLLM and DCGM endpoints - Open Grafana and navigate to Dashboards > New > Import
- Upload the
.jsonfiles from the/dashboardsfolder - Configure Variables:
- Set the GPU Instance and vLLM Instance variables at the top of the dashboard to match your Docker network IPs
- Adjust the Input/Output Cost and Electricity Rate in the Token dashboard to match your local currency and current utility pricing