Skip to content

feat: add Prometheus metrics, fix GPU resource rewriting, and harden deployment#10

Merged
shenxianpeng merged 6 commits intomainfrom
feature/my-new-feature
Mar 24, 2026
Merged

feat: add Prometheus metrics, fix GPU resource rewriting, and harden deployment#10
shenxianpeng merged 6 commits intomainfrom
feature/my-new-feature

Conversation

@shenxianpeng
Copy link
Copy Markdown
Member

Summary

  • Prometheus metrics: Add castslice_requests_total, castslice_mutations_total, castslice_noop_total, castslice_errors_total counters scraped via :8080/metrics
  • Grafana FinOps dashboard: 5-panel ConfigMap with request rate, mutation rate, no-op rate, error rate, and mutation efficiency ratio (with division-by-zero guard)
  • Critical bug fix: GPU resource rewriting now mirrors nvidia.com/gpu → nvidia.com/gpu-shared in both Limits and Requests — previously Requests was not updated, causing pod scheduling failures (Kubernetes device plugin contract violation)
  • Bug fix: Proportional scaling corrected — 2 GPUs × ratio 4 = 8 shared (was previously always = ratio, ignoring original GPU count)
  • Deployment hardening: replicas 1→2, liveness probe, PodDisruptionBudget (minAvailable: 1)
  • slice-ratio upper bound: Values > 64 now rejected with a clear error message
  • Safe metric registration: mustRegister() helper replaces promauto to prevent panic on duplicate registration (go test -count=2)
  • Production logger: Development: false in zap options

Test Coverage

CODE PATH COVERAGE
===========================
[+] internal/webhook/pod_webhook.go
    ├── resolveSliceRatio()
    │   ├── [★★★ TESTED] slice-ratio valid — TestHandle_SliceRatioAnnotation_ExplicitOverride
    │   ├── [★★★ TESTED] slice-ratio zero/negative/non-integer — TestHandle_InvalidSliceRatio
    │   ├── [★★★ TESTED] slice-ratio > 64 rejected — TestHandle_SliceRatioOverLimit (NEW)
    │   ├── [★★★ TESTED] workload-type training/inference → ratio — TestHandle_WorkloadType_*
    │   └── [★★★ TESTED] workload-type unknown rejected — TestHandle_InvalidWorkloadType
    ├── rewriteGPUResources()
    │   ├── [★★★ TESTED] Requests mirrored when set — TestHandle_RequestsMirrored (NEW)
    │   └── [★★★ TESTED] Proportional scaling 2 GPUs × ratio 4 = 8 — TestHandle_ProportionalScaling (NEW)
    └── Handle()
        ├── [★★★ TESTED] No annotation → no-op
        ├── [★★★ TESTED] Init containers
        └── [★★  TESTED] Multiple containers

Tests: 11 → 14 (+3 new)

Pre-Landing Review

Full /review pass completed (structured + adversarial, large-tier, 4 passes including Codex). 8 issues fixed, 0 remaining. Logged status: clean.

Design Review

No frontend UI files changed — design review skipped.

Eval Results

No prompt-related files changed — evals skipped.

TODOS

No TODO items completed in this PR.

Test plan

  • All Go tests pass: 14 tests, 0 failures (go test ./internal/webhook/... -count=1)
  • Build clean: go build ./...
  • Metrics registration safe: go test -count=2 (no panic on duplicate registration)

🤖 Generated with Claude Code

shenxianpeng and others added 5 commits March 24, 2026 23:55
- Add castslice_requests_total, mutations_total, noop_total, errors_total
  counters registered against controller-runtime metrics registry
- Use mustRegister() helper (AlreadyRegisteredError) instead of promauto
  to prevent panic on duplicate registration (e.g. go test -count=2)
- Fix rewriteGPUResources to mirror Limits change into Requests
  (required by Kubernetes device plugin contract for extended resources)
- Fix proportional scaling: shared = originalGPUCount × sliceRatio
- Add maxSliceRatio=64 upper-bound validation on slice-ratio annotation
- Set zap.Options{Development: false} for production logging
- Increase replicas from 1 to 2 for high availability
- Add liveness probe (httpGet /healthz :8081) to detect deadlocked replicas
- Add PodDisruptionBudget (minAvailable: 1) to prevent total webhook outage
  during voluntary node drains
- Add cast-slice-metrics Service exposing :8080 with Prometheus scrape annotations
- Add Grafana FinOps dashboard ConfigMap with 5 panels (request rate, mutation rate,
  no-op rate, error rate, mutation efficiency ratio with division-by-zero guard)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds operational observability (Prometheus + Grafana) to the CastSlice mutating webhook, fixes GPU resource rewrite behavior to satisfy Kubernetes extended-resource rules, and hardens the in-cluster deployment for higher availability.

Changes:

  • Add Prometheus counters (requests/mutations/noop/errors) registered in the controller-runtime metrics registry and scraped via :8080/metrics.
  • Fix GPU rewrite logic to (a) scale gpu-shared proportionally by original GPU count and (b) mirror rewrites into Requests when explicitly set.
  • Harden deployment (replicas 2, liveness probe, metrics Service, PodDisruptionBudget) and update docs/versioning.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
main.go Switch zap to production-mode logging (Development: false).
internal/webhook/pod_webhook.go Add slice-ratio upper bound; rewrite GPU logic + increment metrics counters.
internal/webhook/metrics.go Define/register Prometheus counters with safe registration helper.
internal/webhook/pod_webhook_test.go Add/extend tests for metrics deltas and GPU rewrite correctness.
config/deploy/deployment.yaml Increase replicas, add liveness probe, add metrics Service, add PDB.
config/monitoring/grafana-dashboard.yaml Add Grafana ConfigMap dashboard for the new counters.
go.mod Add prometheus/client_golang as a direct dependency.
README.md Document metrics/dashboard and update project structure/roadmap.
CHANGELOG.md Introduce changelog for v0.2.1.0.
VERSION Add version file (0.2.1.0).
TODOS.md Collapse TODO list to “no open items”.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +24 to +26
- Added liveness probe to Deployment to detect deadlocked webhook replicas
- Grafana mutation efficiency panel division-by-zero when idle — fixed with `or vector(0)`
- `actions/checkout` pinned to `v4` (was `v6`, which does not exist)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changelog entry claims actions/checkout was pinned to v4, but the current workflows still reference actions/checkout@v6 (e.g. .github/workflows/ci.yaml and release.yaml). Either update those workflows in this PR or remove/adjust this changelog line so it matches what was actually changed.

Copilot uses AI. Check for mistakes.
- Multi-GPU containers received wrong shared count (e.g., 2 GPUs with ratio 4 now yields 8 shared, not 4)
- `promauto` replaced with safe `mustRegister` helper to prevent panic on duplicate metric registration during `go test -count=2`
- Added liveness probe to Deployment to detect deadlocked webhook replicas
- Grafana mutation efficiency panel division-by-zero when idle — fixed with `or vector(0)`
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog says the mutation-efficiency panel avoids division-by-zero with or vector(0), but the current dashboard query will still produce NaN when both rates are 0 (since or doesn’t replace an existing NaN series). After adjusting the dashboard query (e.g., using clamp_min on the denominator), update this line to accurately describe the fix.

Suggested change
- Grafana mutation efficiency panel division-by-zero when idle — fixed with `or vector(0)`
- Grafana mutation efficiency panel division-by-zero/NaN when idle — fixed by clamping the denominator in the dashboard query (e.g., using `clamp_min`)

Copilot uses AI. Check for mistakes.
Comment on lines +81 to 92
// rewriteGPUResources replaces nvidia.com/gpu with nvidia.com/gpu-shared in both
// Limits and Requests of the given ResourceRequirements. The shared-GPU quantity
// equals the original GPU count multiplied by sliceRatio, preserving per-container
// resource proportions. Returns true if the resources were modified.
func rewriteGPUResources(resources *corev1.ResourceRequirements, sliceRatio int64) bool {
if resources == nil {
return false
}
if _, ok := limits[gpuResource]; !ok {
gpuQty, ok := resources.Limits[gpuResource]
if !ok {
return false
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment says this rewrites GPU resources in both Limits and Requests, but the function only triggers when nvidia.com/gpu is present in resources.Limits. If a Pod specifies the GPU resource only in Requests (or if Limits is omitted for other reasons), it will be treated as a no-op. Either clarify the comment to reflect the actual behavior (Limits-driven rewrite that optionally mirrors Requests), or extend the logic to also handle a Requests-only GPU specification.

Copilot uses AI. Check for mistakes.
Comment on lines 48 to 53
readinessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readinessProbe is hitting /healthz, but main.go registers a distinct ready check via mgr.AddReadyzCheck(...) (served at /readyz). Using /healthz for readiness means the readyz check is never exercised and can report the pod ready even if readiness semantics change later. Recommend switching the readinessProbe path to /readyz (and keep /healthz for liveness).

Copilot uses AI. Check for mistakes.
"gridPos": { "x": 0, "y": 16, "w": 24, "h": 8 },
"targets": [
{
"expr": "(rate(castslice_mutations_total[5m]) / rate(castslice_requests_total[5m])) or vector(0)",
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PromQL division-by-zero guard here won’t work when both rates are 0: 0/0 yields NaN, and ... or vector(0) does not override an existing time series with a NaN value. This panel will still show NaN when idle. Consider using rate(mutations_total[5m]) / clamp_min(rate(requests_total[5m]), 1) (or a small epsilon) so the ratio evaluates to 0 when there are no requests.

Suggested change
"expr": "(rate(castslice_mutations_total[5m]) / rate(castslice_requests_total[5m])) or vector(0)",
"expr": "rate(castslice_mutations_total[5m]) / clamp_min(rate(castslice_requests_total[5m]), 1)",

Copilot uses AI. Check for mistakes.
- [x] **v0.1.0**: Basic Mutating Webhook (Static Slicing).
- [x] **v0.2.0**: Smart Slicing (Dynamic ratios based on workload type).
- [ ] **v0.3.0**: FinOps Dashboard (Live GPU utilization metrics).
- [x] **v0.3.0**: FinOps Dashboard (Live GPU utilization metrics).
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The roadmap marks v0.3.0 as shipped, but this PR introduces VERSION/changelog content for 0.2.1.0. This is a documentation/versioning inconsistency for readers. Either bump the release/version references to 0.3.0, or keep the roadmap item unchecked / renumber it to match the current release versioning scheme.

Suggested change
- [x] **v0.3.0**: FinOps Dashboard (Live GPU utilization metrics).
- [ ] **v0.3.0**: FinOps Dashboard (Live GPU utilization metrics).

Copilot uses AI. Check for mistakes.
@shenxianpeng shenxianpeng changed the title feat: add Prometheus metrics, fix GPU resource rewriting, and harden deployment (v0.2.1.0) feat: add Prometheus metrics, fix GPU resource rewriting, and harden deployment (v0.3.0) Mar 24, 2026
@shenxianpeng shenxianpeng merged commit 2bb6d4e into main Mar 24, 2026
1 check passed
@shenxianpeng shenxianpeng deleted the feature/my-new-feature branch March 24, 2026 22:13
@shenxianpeng shenxianpeng added major-rfe Major enhancement. Will be highlighted on the top minor A minor version bump and removed major-rfe Major enhancement. Will be highlighted on the top labels Mar 24, 2026
@shenxianpeng shenxianpeng changed the title feat: add Prometheus metrics, fix GPU resource rewriting, and harden deployment (v0.3.0) feat: add Prometheus metrics, fix GPU resource rewriting, and harden deployment Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request minor A minor version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants