vLLM Homeops Hub

A collection of high-performance, production-ready LLM deployment stacks for HomeOps environments. This repository focuses on squeezing enterprise-grade throughput out of consumer-grade hardware using vLLM, FP8 quantization, and meticulously tuned Speculative Decoding

The "State of Grace"

This isn't just a folder of compose files; it's the result of pushing over 47 million tokens through a single node to find the perfect balance between hardware weight and assistant reactivity. Like a Triple-Fried Egg, Chilli, and Chutney sandwich, these configurations shouldn't work this well, but they do. This setup represents the "Holly" of local inference—taking the raw compute of the GB10 Spark and turning it into something that actually feels fundamentally reactive

Project Structure

compose-monitoring.yaml - The observability backbone of the entire ship; contains the Prometheus, Grafana, and DCGM-Exporter stack required to baseline your VRAM and track speculative hit rates in real-time without interfering with the model runtimes
compose-qwen3.6-35b.yaml- The current heavy-lifter in the fleet; tuned specifically for Qwen3.6-35B-FP8 with a dflash speculative draft model and a massive 256k context window for deep repository analysis
compose-gemma-4-26B-nvfp4.yaml - The MoE (Mixture of Experts) specialist; optimised for the Gemma-4-26B-A4B architecture with Eagle3 speculative encoding enabled
prometheus.yaml - Prometheus configuration file to extract metrics from the vllm runtime and dcgm-exporter pods
dashboards/ - A curated set of Grafana dashboard JSON exports designed to get your metrics up and running without any manual, "smeg-headed" clicking or configuration drift

Getting Started

Clone the Repo

git clone https://github.com/Kelcode-Dev/vllm-homeops-hub

Create the network

docker network create model-runtimes

Spin up Monitoring

docker compose -f compose-monitoring.yaml up -d

Deploy your Model

docker compose -f compose-qwen3.6-35b-fp8.yaml up -d

Adjust your gpu-memory-utilization and max-model-len based on your specific VRAM headroom to ensure the ship doesn't do a "Lister-style" crash during heavy ingestion

Observability & Dashboards

The true power of this stack lies in its visibility. To move beyond "vibes" and into precise tuning, the /dashboards folder contains pre-configured Grafana JSON exports that provide a real-time "State of Grace" view of your hardware and model performance.

1. GPU Service Health (`gpu-model-health.json`)

A comprehensive "Mission Control" view combining NVIDIA DCGM hardware metrics with vLLM engine performance:

Hardware Vitals Real-time monitoring of GPU Utilisation, SM Clock speeds, Power Draw (Watts), and die/memory temperatures to ensure your hardware isn't running hotter than a supernova
Engine Pressure Tracks the KV Cache usage percentage—critical for preventing OOMs during long-context repository analysis—and active preemption events that signal the engine is struggling
Speculative Performance Dedicated panels for Draft Token Acceptance Rate and average tokens accepted per draft; a sustained drop here suggests your draft model has diverged from the target's distribution
Latency Radar Visualises Time to First Token (TTFT) and Inter-Token Latency (p50/p99) to ensure your "Rule of 4" tuning is actually delivering the snappy, reactive experience it claims to

2. LLM Token Usage & Cost Savings (`llm-token-usage-and-cost-savings.json`)

The "Financial Controller" for your homelab, calculating the ROI of self-hosting vs. commercial APIs:

Token Velocity A cumulative breakdown of prompt (ISL) vs. generation (OSL) tokens over 1, 7, 14, and 30-day windows to track your overall consumption trends
Claude 3.5 ROI Dynamically calculates "Cost Savings" by comparing your local volume against commercial pricing, defaulting to current market rates for Sonnet-tier performance
The Electricity Reality Uses a configurable £/kWh rate to calculate the actual running cost of your GB10 Spark device—providing a blunt "Self-Host vs. API" value comparison that doesn't sugar-coat the bill

How to Import

Ensure your compose-monitoring.yml stack is running and Prometheus is successfully scraping your vLLM and DCGM endpoints
Open Grafana and navigate to Dashboards > New > Import
Upload the .json files from the /dashboards folder
Configure Variables:

Set the GPU Instance and vLLM Instance variables at the top of the dashboard to match your Docker network IPs
Adjust the Input/Output Cost and Electricity Rate in the Token dashboard to match your local currency and current utility pricing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM Homeops Hub

The "State of Grace"

Project Structure

Getting Started

Observability & Dashboards

1. GPU Service Health (`gpu-model-health.json`)

2. LLM Token Usage & Cost Savings (`llm-token-usage-and-cost-savings.json`)

How to Import

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dashboards		dashboards
README.md		README.md
compose-gemma-4-26B-nvfp4.yaml		compose-gemma-4-26B-nvfp4.yaml
compose-monitoring.yaml		compose-monitoring.yaml
compose-qwen3.6-35b-fp8.yaml		compose-qwen3.6-35b-fp8.yaml
prometheus.yaml		prometheus.yaml

Folders and files

Latest commit

History

Repository files navigation

vLLM Homeops Hub

The "State of Grace"

Project Structure

Getting Started

Observability & Dashboards

1. GPU Service Health (gpu-model-health.json)

2. LLM Token Usage & Cost Savings (llm-token-usage-and-cost-savings.json)

How to Import

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

1. GPU Service Health (`gpu-model-health.json`)

2. LLM Token Usage & Cost Savings (`llm-token-usage-and-cost-savings.json`)

Packages