Skip to content

feat(monitoring): Grafana + Prometheus + GPU metrics embedded in dashboard#71

Open
AlienWalker1995 wants to merge 3 commits into
mainfrom
feat/grafana-monitoring
Open

feat(monitoring): Grafana + Prometheus + GPU metrics embedded in dashboard#71
AlienWalker1995 wants to merge 3 commits into
mainfrom
feat/grafana-monitoring

Conversation

@AlienWalker1995

Copy link
Copy Markdown
Owner

What

Opt-in (--profile monitoring) real-time llama.cpp + GPU performance monitoring, embedded in the dashboard's new 📊 Grafana tab and served through the existing Caddy + Google SSO front door at /grafana/.

Architecture

llama-server --metrics ─┐
gpu-exporter (nvidia-smi)┤─► Prometheus ─► Grafana ─► Caddy /grafana/ (SSO) ─► dashboard iframe
  • llama-server --metrics — native Prometheus endpoint (token rates, KV-cache usage, request queue). Internal-only; harmless when unscraped.
  • prometheus — scrapes llamacpp + the GPU exporter (15s, 15d retention).
  • gpu-exporternvidia_gpu_exporter (wraps nvidia-smi; the right tool for this host's consumer GPUs on WSL2, where DCGM-exporter doesn't work).
  • grafana — anonymous read-only (SSO is the gate), subpath serving + embedding; datasource + a llama.cpp/GPU dashboard auto-provisioned.
  • Caddy /grafana/ route; dashboard tab lazy-loads the iframe (?kiosk).
  • All images pinned by digest; no host ports; .gitattributes pins monitoring/** to LF.

Real bug caught during validation

The GPU exporter's default field auto-discovery panic-crashes on driver 581.80: nvidia-smi emits clocks_event_reasons_counters.sw_thermal_slowdown [us], whose derived metric name has a space + brackets → invalid Prometheus name → crash-loop. Fixed by pinning --query-field-names to an explicit safe set. Verified live standalone on the 5090 + 1070.

Validation

  • ✅ compose config, promtool check config (SUCCESS), caddy validate ("Valid configuration"), dashboard JSON, shell syntax.
  • GPU exporter + metric names exercised live standalone — dashboard GPU panels use the real names (nvidia_smi_utilization_gpu_ratio, ..._memory_used_bytes, ..._temperature_gpu, ..._power_draw_watts) with group_left(name) joins (GPU name is only on nvidia_smi_gpu_info, not the value metrics).
  • ⚠️ llama.cpp panel metric names are the documented set (llamacpp:predicted_tokens_seconds, prompt_tokens_seconds, kv_cache_usage_ratio, requests_processing, requests_deferred) — not yet live-verified because that needs llamacpp to restart with --metrics. Confirm on first live scrape.

Enable (operator — needs SOPS secrets, so runs on the host)

make up   # (decrypts secrets), then:
docker compose --env-file .env --env-file ~/.ai-toolkit/runtime/.env --profile monitoring up -d --build
docker compose ... up -d --force-recreate llamacpp   # applies --metrics (model reload)

Then open the dashboard → Grafana tab, or https://<host>/grafana/.

🤖 Generated with Claude Code

Hermes Bot and others added 3 commits July 4, 2026 21:18
…hboard

Opt-in (--profile monitoring) real-time llama.cpp + GPU performance monitoring,
embedded in the dashboard's new Grafana tab and served via Caddy + SSO at /grafana/.

- llama-server now runs with --metrics (native Prometheus endpoint: token rates,
  KV-cache usage, request queue). Internal-only; harmless when unscraped.
- prometheus: scrapes llamacpp:8080/metrics + the GPU exporter (15s, 15d retention).
- gpu-exporter: nvidia_gpu_exporter (wraps nvidia-smi — the right tool for this host's
  consumer GPUs on WSL2, where DCGM-exporter does not work). Pinned to an explicit
  --query-field-names set: the default AUTO discovery panic-crashes on driver 581.80
  (nvidia-smi emits `...sw_thermal_slowdown [us]`, an invalid Prometheus metric name).
  Verified live standalone on the 5090 + 1070.
- grafana: anonymous read-only (SSO is the gate), serves from /grafana/ subpath, embedding
  enabled; datasource + a llama.cpp/GPU dashboard auto-provisioned. GPU panels use the
  live-verified metric names + `group_left(name)` joins for per-GPU legends.
- Caddy /grafana/ route (protected block); dashboard Grafana tab lazy-loads the iframe.
- All images pinned by digest; no host ports; .env.example + CHANGELOG.

Validated: compose config, promtool check, caddyfile ("Valid configuration"), dashboard
JSON, shell syntax; GPU exporter + metric names exercised live standalone. llama.cpp
panel metric names are the documented set (prompt/predicted_tokens_seconds,
kv_cache_usage_ratio, requests_processing/deferred) — to be confirmed on first live
scrape once llamacpp restarts with --metrics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Live validation against the running stack caught that this llama.cpp build does
NOT expose `llamacpp:kv_cache_usage_ratio` (the two KV-cache panels would have
been blank). Replaced them with metrics that exist and are confirmed to carry
data: a "Decode rate (tok/s, 5m avg)" stat and a "Smoothed throughput (5m rate)"
timeseries, both from `rate(llamacpp:tokens_predicted_total)` /
`rate(llamacpp:prompt_tokens_total)`.

Now fully live-validated: llamacpp recreated with --metrics (Prometheus target
up=1); all 6 llama.cpp queries + all 4 GPU queries resolve with data; Grafana
provisioned + serving the dashboard; Caddy /grafana/ → 302 SSO; both GPUs
(5090 + 1070) reporting util/VRAM/temp/power.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant