Skip to content

Kelcode-Dev/vllm-homeops-hub

Repository files navigation

vLLM Homeops Hub

A collection of high-performance, production-ready LLM deployment stacks for HomeOps environments. This repository focuses on squeezing enterprise-grade throughput out of consumer-grade hardware using vLLM, FP8 quantization, and meticulously tuned Speculative Decoding

The "State of Grace"

This isn't just a folder of compose files; it's the result of pushing over 47 million tokens through a single node to find the perfect balance between hardware weight and assistant reactivity. Like a Triple-Fried Egg, Chilli, and Chutney sandwich, these configurations shouldn't work this well, but they do. This setup represents the "Holly" of local inference—taking the raw compute of the GB10 Spark and turning it into something that actually feels fundamentally reactive

Project Structure

  • compose-monitoring.yaml - The observability backbone of the entire ship; contains the Prometheus, Grafana, and DCGM-Exporter stack required to baseline your VRAM and track speculative hit rates in real-time without interfering with the model runtimes
  • compose-qwen3.6-35b.yaml- The current heavy-lifter in the fleet; tuned specifically for Qwen3.6-35B-FP8 with a dflash speculative draft model and a massive 256k context window for deep repository analysis
  • compose-gemma-4-26B-nvfp4.yaml - The MoE (Mixture of Experts) specialist; optimised for the Gemma-4-26B-A4B architecture with Eagle3 speculative encoding enabled
  • prometheus.yaml - Prometheus configuration file to extract metrics from the vllm runtime and dcgm-exporter pods
  • dashboards/ - A curated set of Grafana dashboard JSON exports designed to get your metrics up and running without any manual, "smeg-headed" clicking or configuration drift

Getting Started

  1. Clone the Repo
git clone https://github.com/Kelcode-Dev/vllm-homeops-hub
  1. Create the network
docker network create model-runtimes
  1. Spin up Monitoring
docker compose -f compose-monitoring.yaml up -d
  1. Deploy your Model
docker compose -f compose-qwen3.6-35b-fp8.yaml up -d

Adjust your gpu-memory-utilization and max-model-len based on your specific VRAM headroom to ensure the ship doesn't do a "Lister-style" crash during heavy ingestion

Observability & Dashboards

The true power of this stack lies in its visibility. To move beyond "vibes" and into precise tuning, the /dashboards folder contains pre-configured Grafana JSON exports that provide a real-time "State of Grace" view of your hardware and model performance.

1. GPU Service Health (gpu-model-health.json)

A comprehensive "Mission Control" view combining NVIDIA DCGM hardware metrics with vLLM engine performance:

  • Hardware Vitals Real-time monitoring of GPU Utilisation, SM Clock speeds, Power Draw (Watts), and die/memory temperatures to ensure your hardware isn't running hotter than a supernova
  • Engine Pressure Tracks the KV Cache usage percentage—critical for preventing OOMs during long-context repository analysis—and active preemption events that signal the engine is struggling
  • Speculative Performance Dedicated panels for Draft Token Acceptance Rate and average tokens accepted per draft; a sustained drop here suggests your draft model has diverged from the target's distribution
  • Latency Radar Visualises Time to First Token (TTFT) and Inter-Token Latency (p50/p99) to ensure your "Rule of 4" tuning is actually delivering the snappy, reactive experience it claims to

2. LLM Token Usage & Cost Savings (llm-token-usage-and-cost-savings.json)

The "Financial Controller" for your homelab, calculating the ROI of self-hosting vs. commercial APIs:

  • Token Velocity A cumulative breakdown of prompt (ISL) vs. generation (OSL) tokens over 1, 7, 14, and 30-day windows to track your overall consumption trends
  • Claude 3.5 ROI Dynamically calculates "Cost Savings" by comparing your local volume against commercial pricing, defaulting to current market rates for Sonnet-tier performance
  • The Electricity Reality Uses a configurable £/kWh rate to calculate the actual running cost of your GB10 Spark device—providing a blunt "Self-Host vs. API" value comparison that doesn't sugar-coat the bill

How to Import

  1. Ensure your compose-monitoring.yml stack is running and Prometheus is successfully scraping your vLLM and DCGM endpoints
  2. Open Grafana and navigate to Dashboards > New > Import
  3. Upload the .json files from the /dashboards folder
  4. Configure Variables:
  • Set the GPU Instance and vLLM Instance variables at the top of the dashboard to match your Docker network IPs
  • Adjust the Input/Output Cost and Electricity Rate in the Token dashboard to match your local currency and current utility pricing

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors