feat(monitoring): Grafana + Prometheus + GPU metrics embedded in dashboard by AlienWalker1995 · Pull Request #71 · AlienWalker1995/Ordo-AI-Stack

AlienWalker1995 · 2026-07-05T01:19:23Z

What

Opt-in (--profile monitoring) real-time llama.cpp + GPU performance monitoring, embedded in the dashboard's new 📊 Grafana tab and served through the existing Caddy + Google SSO front door at /grafana/.

Architecture

llama-server --metrics ─┐
gpu-exporter (nvidia-smi)┤─► Prometheus ─► Grafana ─► Caddy /grafana/ (SSO) ─► dashboard iframe

llama-server --metrics — native Prometheus endpoint (token rates, KV-cache usage, request queue). Internal-only; harmless when unscraped.
prometheus — scrapes llamacpp + the GPU exporter (15s, 15d retention).
gpu-exporter — nvidia_gpu_exporter (wraps nvidia-smi; the right tool for this host's consumer GPUs on WSL2, where DCGM-exporter doesn't work).
grafana — anonymous read-only (SSO is the gate), subpath serving + embedding; datasource + a llama.cpp/GPU dashboard auto-provisioned.
Caddy /grafana/ route; dashboard tab lazy-loads the iframe (?kiosk).
All images pinned by digest; no host ports; .gitattributes pins monitoring/** to LF.

Real bug caught during validation

The GPU exporter's default field auto-discovery panic-crashes on driver 581.80: nvidia-smi emits clocks_event_reasons_counters.sw_thermal_slowdown [us], whose derived metric name has a space + brackets → invalid Prometheus name → crash-loop. Fixed by pinning --query-field-names to an explicit safe set. Verified live standalone on the 5090 + 1070.

Validation

✅ compose config, promtool check config (SUCCESS), caddy validate ("Valid configuration"), dashboard JSON, shell syntax.
✅ GPU exporter + metric names exercised live standalone — dashboard GPU panels use the real names (nvidia_smi_utilization_gpu_ratio, ..._memory_used_bytes, ..._temperature_gpu, ..._power_draw_watts) with group_left(name) joins (GPU name is only on nvidia_smi_gpu_info, not the value metrics).
⚠️ llama.cpp panel metric names are the documented set (llamacpp:predicted_tokens_seconds, prompt_tokens_seconds, kv_cache_usage_ratio, requests_processing, requests_deferred) — not yet live-verified because that needs llamacpp to restart with --metrics. Confirm on first live scrape.

Enable (operator — needs SOPS secrets, so runs on the host)

make up   # (decrypts secrets), then:
docker compose --env-file .env --env-file ~/.ai-toolkit/runtime/.env --profile monitoring up -d --build
docker compose ... up -d --force-recreate llamacpp   # applies --metrics (model reload)

Then open the dashboard → Grafana tab, or https://<host>/grafana/.

🤖 Generated with Claude Code

…hboard Opt-in (--profile monitoring) real-time llama.cpp + GPU performance monitoring, embedded in the dashboard's new Grafana tab and served via Caddy + SSO at /grafana/. - llama-server now runs with --metrics (native Prometheus endpoint: token rates, KV-cache usage, request queue). Internal-only; harmless when unscraped. - prometheus: scrapes llamacpp:8080/metrics + the GPU exporter (15s, 15d retention). - gpu-exporter: nvidia_gpu_exporter (wraps nvidia-smi — the right tool for this host's consumer GPUs on WSL2, where DCGM-exporter does not work). Pinned to an explicit --query-field-names set: the default AUTO discovery panic-crashes on driver 581.80 (nvidia-smi emits `...sw_thermal_slowdown [us]`, an invalid Prometheus metric name). Verified live standalone on the 5090 + 1070. - grafana: anonymous read-only (SSO is the gate), serves from /grafana/ subpath, embedding enabled; datasource + a llama.cpp/GPU dashboard auto-provisioned. GPU panels use the live-verified metric names + `group_left(name)` joins for per-GPU legends. - Caddy /grafana/ route (protected block); dashboard Grafana tab lazy-loads the iframe. - All images pinned by digest; no host ports; .env.example + CHANGELOG. Validated: compose config, promtool check, caddyfile ("Valid configuration"), dashboard JSON, shell syntax; GPU exporter + metric names exercised live standalone. llama.cpp panel metric names are the documented set (prompt/predicted_tokens_seconds, kv_cache_usage_ratio, requests_processing/deferred) — to be confirmed on first live scrape once llamacpp restarts with --metrics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Live validation against the running stack caught that this llama.cpp build does NOT expose `llamacpp:kv_cache_usage_ratio` (the two KV-cache panels would have been blank). Replaced them with metrics that exist and are confirmed to carry data: a "Decode rate (tok/s, 5m avg)" stat and a "Smoothed throughput (5m rate)" timeseries, both from `rate(llamacpp:tokens_predicted_total)` / `rate(llamacpp:prompt_tokens_total)`. Now fully live-validated: llamacpp recreated with --metrics (Prometheus target up=1); all 6 llama.cpp queries + all 4 GPU queries resolve with data; Grafana provisioned + serving the dashboard; Caddy /grafana/ → 302 SSO; both GPUs (5090 + 1070) reporting util/VRAM/temp/power. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Hermes Bot and others added 3 commits July 4, 2026 21:18

chore: pin monitoring/** to LF for bind-mounted configs

5a41e5f

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(monitoring): Grafana + Prometheus + GPU metrics embedded in dashboard#71

feat(monitoring): Grafana + Prometheus + GPU metrics embedded in dashboard#71
AlienWalker1995 wants to merge 3 commits into
mainfrom
feat/grafana-monitoring

AlienWalker1995 commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlienWalker1995 commented Jul 5, 2026

What

Architecture

Real bug caught during validation

Validation

Enable (operator — needs SOPS secrets, so runs on the host)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant