feat(monitoring): Grafana + Prometheus + GPU metrics embedded in dashboard#71
Open
AlienWalker1995 wants to merge 3 commits into
Open
feat(monitoring): Grafana + Prometheus + GPU metrics embedded in dashboard#71AlienWalker1995 wants to merge 3 commits into
AlienWalker1995 wants to merge 3 commits into
Conversation
…hboard
Opt-in (--profile monitoring) real-time llama.cpp + GPU performance monitoring,
embedded in the dashboard's new Grafana tab and served via Caddy + SSO at /grafana/.
- llama-server now runs with --metrics (native Prometheus endpoint: token rates,
KV-cache usage, request queue). Internal-only; harmless when unscraped.
- prometheus: scrapes llamacpp:8080/metrics + the GPU exporter (15s, 15d retention).
- gpu-exporter: nvidia_gpu_exporter (wraps nvidia-smi — the right tool for this host's
consumer GPUs on WSL2, where DCGM-exporter does not work). Pinned to an explicit
--query-field-names set: the default AUTO discovery panic-crashes on driver 581.80
(nvidia-smi emits `...sw_thermal_slowdown [us]`, an invalid Prometheus metric name).
Verified live standalone on the 5090 + 1070.
- grafana: anonymous read-only (SSO is the gate), serves from /grafana/ subpath, embedding
enabled; datasource + a llama.cpp/GPU dashboard auto-provisioned. GPU panels use the
live-verified metric names + `group_left(name)` joins for per-GPU legends.
- Caddy /grafana/ route (protected block); dashboard Grafana tab lazy-loads the iframe.
- All images pinned by digest; no host ports; .env.example + CHANGELOG.
Validated: compose config, promtool check, caddyfile ("Valid configuration"), dashboard
JSON, shell syntax; GPU exporter + metric names exercised live standalone. llama.cpp
panel metric names are the documented set (prompt/predicted_tokens_seconds,
kv_cache_usage_ratio, requests_processing/deferred) — to be confirmed on first live
scrape once llamacpp restarts with --metrics.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Live validation against the running stack caught that this llama.cpp build does NOT expose `llamacpp:kv_cache_usage_ratio` (the two KV-cache panels would have been blank). Replaced them with metrics that exist and are confirmed to carry data: a "Decode rate (tok/s, 5m avg)" stat and a "Smoothed throughput (5m rate)" timeseries, both from `rate(llamacpp:tokens_predicted_total)` / `rate(llamacpp:prompt_tokens_total)`. Now fully live-validated: llamacpp recreated with --metrics (Prometheus target up=1); all 6 llama.cpp queries + all 4 GPU queries resolve with data; Grafana provisioned + serving the dashboard; Caddy /grafana/ → 302 SSO; both GPUs (5090 + 1070) reporting util/VRAM/temp/power. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Opt-in (
--profile monitoring) real-time llama.cpp + GPU performance monitoring, embedded in the dashboard's new 📊 Grafana tab and served through the existing Caddy + Google SSO front door at/grafana/.Architecture
--metrics— native Prometheus endpoint (token rates, KV-cache usage, request queue). Internal-only; harmless when unscraped.nvidia_gpu_exporter(wrapsnvidia-smi; the right tool for this host's consumer GPUs on WSL2, where DCGM-exporter doesn't work)./grafana/route; dashboard tab lazy-loads the iframe (?kiosk)..gitattributespinsmonitoring/**to LF.Real bug caught during validation
The GPU exporter's default field auto-discovery panic-crashes on driver 581.80:
nvidia-smiemitsclocks_event_reasons_counters.sw_thermal_slowdown [us], whose derived metric name has a space + brackets → invalid Prometheus name → crash-loop. Fixed by pinning--query-field-namesto an explicit safe set. Verified live standalone on the 5090 + 1070.Validation
promtool check config(SUCCESS),caddy validate("Valid configuration"), dashboard JSON, shell syntax.nvidia_smi_utilization_gpu_ratio,..._memory_used_bytes,..._temperature_gpu,..._power_draw_watts) withgroup_left(name)joins (GPU name is only onnvidia_smi_gpu_info, not the value metrics).llamacpp:predicted_tokens_seconds,prompt_tokens_seconds,kv_cache_usage_ratio,requests_processing,requests_deferred) — not yet live-verified because that needs llamacpp to restart with--metrics. Confirm on first live scrape.Enable (operator — needs SOPS secrets, so runs on the host)
Then open the dashboard → Grafana tab, or
https://<host>/grafana/.🤖 Generated with Claude Code