Skip to content

chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973

Description

@yuanchen8911

Summary

PR #965 mitigates the stale-NVML class of bug by hard-coding the gpu-operator chart version into a DRA pod-template annotation. The annotation works as long as it's kept in lockstep with the gpu-operator chart pin — which depends on a maintainer remembering to bump both in the same PR.

That's brittle. A future gpu-operator bump that forgets the annotation will silently re-introduce the original failure (stale NVML, DRA FailedPrepareDynamicResources, GPU workloads stuck in ContainerCreating).

This issue tracks the durable fix.

Pre-PR-#965 repro (context)

After upgrading gpu-operator from v25.10.1 (driver 580.105.08) → v26.3.1 (driver 580.126.20), the nvidia-dra-driver-gpu kubelet-plugin DaemonSet pod is not restarted by helm upgrade. Its in-pod NVML handle remains bound to the old driver, and DRA allocation fails with:

FailedPrepareDynamicResources: Failed to prepare dynamic resources:
  NodePrepareResources failed for ResourceClaim ...:
  unable to create CDI spec file for claim: ...:
  failed to initialize NVML: Driver/library version mismatch

Workloads that request GPUs via DRA (Dynamo DynamoGraphDeployment, aicr validate --phase performance inference-perf, etc.) stay in ContainerCreating indefinitely. Manual workaround pre-#965: kubectl delete pod -n nvidia-dra-driver -l app.kubernetes.io/name=nvidia-dra-driver-gpu.

What's already in place (don't reinvent)

deploy.sh.tmpl (pkg/bundler/deployer/helm/templates/deploy.sh.tmpl:354) already emits a post-install kubectl rollout restart for the DRA kubelet-plugin DaemonSet, gated on name == "nvidia-dra-driver-gpu". So:

  • ✅ Users running the bundled ./deploy.sh (default Helm deployer) get the restart automatically.
  • ❌ Users consuming helmfile, flux, argocd, or argocd-helm artifacts — there's no equivalent in those deployer templates (grep -rn "kubelet-plugin" pkg/bundler/deployer/{helmfile,flux,argocd*}/ returns zero matches).
  • ❌ Users running helm upgrade directly on the upstream chart without going through AICR — out of scope for AICR; only an upstream chart change can cover them.

(localformat is the shared folder writer that all deployers compose, not a user-facing deployer itself.)

PR #965's annotation closes the GitOps-deployer gap by making the pod-template change visible in every deployer's rendered artifact — they all re-roll the DaemonSet on next reconcile. The durability question is how to keep the annotation correct across future chart bumps.

Why PR #965's mitigation isn't durable

# recipes/components/nvidia-dra-driver-gpu/values.yaml
kubeletPlugin:
  podAnnotations:
    aicr.nvidia.com/gpu-operator-chart-version: v26.3.1   # ← must be bumped manually
  • The annotation value is a free-form string. Helm has no awareness it's supposed to mirror gpu-operator's chart version.
  • A future PR that bumps gpu-operator (e.g. v26.4.0) but forgets the annotation produces identical rendered DaemonSet manifests for nvidia-dra-driver-gpu. helm upgrade skips the roll. Stale NVML returns.
  • make qualify does not catch this — no static or runtime check verifies the two pins are coherent.

Recommended durable fix: bundler-derived annotation

Inject the annotation at bundle-generation time from the resolved recipe, not at hand-edit time.

Concrete shape:

  • Source: read the gpu-operator ComponentRef.Version from the resolved recipe (the same value the resolver writes into the bundle for the gpu-operator chart).
  • Destination: inject into both
    • controller.podAnnotations.aicr.nvidia.com/gpu-operator-chart-version
    • kubeletPlugin.podAnnotations.aicr.nvidia.com/gpu-operator-chart-version
      of the nvidia-dra-driver-gpu rendered values.
  • Trigger condition: only inject when both gpu-operator and nvidia-dra-driver-gpu are enabled in the filtered resolved recipe. DefaultBundler.Make (pkg/bundler/bundler.go:213-254) builds an enabledRefs slice that excludes disabled components before any value extraction; the helper should consume that filtered recipeResult.ComponentRefs, not the unfiltered recipe.
  • Injection point: in DefaultBundler.Make, after extractComponentValues (line 277) has loaded recipe values, applied user --set overrides, and applied scheduling overrides — but before buildDeployer (line 311). This window guarantees:
    • every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives the same final componentValues map;
    • a user --set gpuoperator:rollout-trigger=foo does not accidentally defeat the internal rollout annotation (unless AICR intentionally decides to allow that, which would be a separate explicit knob).
  • Maintainer-facing values file in recipes/components/nvidia-dra-driver-gpu/values.yaml then carries no chart-version string at all — that data lives only in the resolved recipe.

Why this is the right shape:

  • Works for every deployer. All deployers consume componentValues from the same code path, so all of them get the pod-template change.
  • Single source of truth. The annotation is derived from the resolved recipe (which is exactly the source chore(recipes): make resolved recipes the single source of truth for chart versions #966 is converging on for chart versions), so there's no separate string to maintain.
  • Bisects cleanly. A maintainer who only edits gpu-operator's pin gets the DRA annotation update for free in the bundle output.

Near-term guardrail (safe to ship while the bundler-derived fix is in flight)

A static coherence check, ideally wired into make qualify or a new make check-dra-rollout-trigger:

  • Compare the resolved recipe's gpu-operator ComponentRef.Version against the annotation value in recipes/components/nvidia-dra-driver-gpu/values.yaml. Fail if they differ.
  • Compare against the resolved recipe, not against recipes/components/gpu-operator/values.yaml or helm.defaultVersion in the registry. This aligns with chore(recipes): make resolved recipes the single source of truth for chart versions #966 (resolved recipe as single source of truth for chart versions) and avoids creating a new drift point.

This is the cheapest guardrail and is worth shipping immediately if the bundler-derived fix isn't ready for the same release.

Definition of done

  • Primary: DefaultBundler.Make injects aicr.nvidia.com/gpu-operator-chart-version (or equivalent annotation) into the rendered nvidia-dra-driver-gpu componentValues for both controller and kubeletPlugin, sourced from the resolved gpu-operator ComponentRef.Version, gated on both refs being enabled in the filtered recipe, injected after extractComponentValues and before buildDeployer. Maintainer-facing values file no longer carries the chart-version string.
  • Acceptance test (generated-artifact parity): bump the gpu-operator version in a fixture; assert the rendered DRA podAnnotations change in at least the default Helm-deployer output AND one GitOps-deployer output (helmfile or flux or argocd). Catches the cross-deployer parity risk.
  • Acceptance test (disabled-component negative case): with nvidia-dra-driver-gpu disabled in the recipe, assert the helper does not inject the annotation anywhere AND does not emit a warning. Proves the gating is correct.
  • Near-term guardrail (optional, ship if bundler change slips): make target that fails when the resolved gpu-operator chart version diverges from the DRA annotation. Compares against resolved recipe (not registry default).
  • Cluster verification: on a cluster running an older gpu-operator, aicr bundle + deploy via the chosen deployer rolls the DRA pods automatically, and aicr validate --phase performance passes without manual intervention.

Out of scope

  • Bumping nvidia-dra-driver-gpu itself (still v25.12.0; not driver-version-bound in its registry pin).
  • Coverage for users running the upstream k8s-dra-driver-gpu Helm chart directly without AICR artifacts — only an upstream chart change can cover that population. Worth a separate upstream conversation, not AICR-tracked.

Related


Credit: v2 reframing (issue is a durability question, not the original bug) from Codex review of v1. v3 refinements (deploy.sh hook already present, bundler-derived annotation as the right primary, resolved-recipe basis for the coherence check, generated-artifact tests, upstream-chart-direct caveat) from Codex review of v2. v4 refinements (filtered-recipe gating, exact injection point between extractComponentValues and buildDeployer, "default Helm deployer" terminology, disabled-component negative test) from Codex review of v3.

Metadata

Metadata

Assignees

Type

No fields configured for Task.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions