You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #965 mitigates the stale-NVML class of bug by hard-coding the gpu-operator chart version into a DRA pod-template annotation. The annotation works as long as it's kept in lockstep with the gpu-operator chart pin — which depends on a maintainer remembering to bump both in the same PR.
That's brittle. A future gpu-operator bump that forgets the annotation will silently re-introduce the original failure (stale NVML, DRA FailedPrepareDynamicResources, GPU workloads stuck in ContainerCreating).
After upgrading gpu-operator from v25.10.1 (driver 580.105.08) → v26.3.1 (driver 580.126.20), the nvidia-dra-driver-gpu kubelet-plugin DaemonSet pod is not restarted by helm upgrade. Its in-pod NVML handle remains bound to the old driver, and DRA allocation fails with:
FailedPrepareDynamicResources: Failed to prepare dynamic resources:
NodePrepareResources failed for ResourceClaim ...:
unable to create CDI spec file for claim: ...:
failed to initialize NVML: Driver/library version mismatch
Workloads that request GPUs via DRA (Dynamo DynamoGraphDeployment, aicr validate --phase performanceinference-perf, etc.) stay in ContainerCreating indefinitely. Manual workaround pre-#965: kubectl delete pod -n nvidia-dra-driver -l app.kubernetes.io/name=nvidia-dra-driver-gpu.
✅ Users running the bundled ./deploy.sh (default Helm deployer) get the restart automatically.
❌ Users consuming helmfile, flux, argocd, or argocd-helm artifacts — there's no equivalent in those deployer templates (grep -rn "kubelet-plugin" pkg/bundler/deployer/{helmfile,flux,argocd*}/ returns zero matches).
❌ Users running helm upgrade directly on the upstream chart without going through AICR — out of scope for AICR; only an upstream chart change can cover them.
(localformat is the shared folder writer that all deployers compose, not a user-facing deployer itself.)
PR #965's annotation closes the GitOps-deployer gap by making the pod-template change visible in every deployer's rendered artifact — they all re-roll the DaemonSet on next reconcile. The durability question is how to keep the annotation correct across future chart bumps.
# recipes/components/nvidia-dra-driver-gpu/values.yamlkubeletPlugin:
podAnnotations:
aicr.nvidia.com/gpu-operator-chart-version: v26.3.1 # ← must be bumped manually
The annotation value is a free-form string. Helm has no awareness it's supposed to mirror gpu-operator's chart version.
A future PR that bumps gpu-operator (e.g. v26.4.0) but forgets the annotation produces identical rendered DaemonSet manifests for nvidia-dra-driver-gpu. helm upgrade skips the roll. Stale NVML returns.
make qualify does not catch this — no static or runtime check verifies the two pins are coherent.
Inject the annotation at bundle-generation time from the resolved recipe, not at hand-edit time.
Concrete shape:
Source: read the gpu-operatorComponentRef.Version from the resolved recipe (the same value the resolver writes into the bundle for the gpu-operator chart).
kubeletPlugin.podAnnotations.aicr.nvidia.com/gpu-operator-chart-version
of the nvidia-dra-driver-gpu rendered values.
Trigger condition: only inject when bothgpu-operator and nvidia-dra-driver-gpu are enabled in the filtered resolved recipe. DefaultBundler.Make (pkg/bundler/bundler.go:213-254) builds an enabledRefs slice that excludes disabled components before any value extraction; the helper should consume that filtered recipeResult.ComponentRefs, not the unfiltered recipe.
Injection point: in DefaultBundler.Make, afterextractComponentValues (line 277) has loaded recipe values, applied user --set overrides, and applied scheduling overrides — but beforebuildDeployer (line 311). This window guarantees:
every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives the same final componentValues map;
a user --set gpuoperator:rollout-trigger=foo does not accidentally defeat the internal rollout annotation (unless AICR intentionally decides to allow that, which would be a separate explicit knob).
Maintainer-facing values file in recipes/components/nvidia-dra-driver-gpu/values.yaml then carries no chart-version string at all — that data lives only in the resolved recipe.
Why this is the right shape:
Works for every deployer. All deployers consume componentValues from the same code path, so all of them get the pod-template change.
Bisects cleanly. A maintainer who only edits gpu-operator's pin gets the DRA annotation update for free in the bundle output.
Near-term guardrail (safe to ship while the bundler-derived fix is in flight)
A static coherence check, ideally wired into make qualify or a new make check-dra-rollout-trigger:
Compare the resolved recipe'sgpu-operatorComponentRef.Version against the annotation value in recipes/components/nvidia-dra-driver-gpu/values.yaml. Fail if they differ.
This is the cheapest guardrail and is worth shipping immediately if the bundler-derived fix isn't ready for the same release.
Definition of done
Primary: DefaultBundler.Make injects aicr.nvidia.com/gpu-operator-chart-version (or equivalent annotation) into the rendered nvidia-dra-driver-gpucomponentValues for both controller and kubeletPlugin, sourced from the resolved gpu-operator ComponentRef.Version, gated on both refs being enabled in the filtered recipe, injected after extractComponentValues and before buildDeployer. Maintainer-facing values file no longer carries the chart-version string.
Acceptance test (generated-artifact parity): bump the gpu-operator version in a fixture; assert the rendered DRA podAnnotations change in at least the default Helm-deployer output AND one GitOps-deployer output (helmfile or flux or argocd). Catches the cross-deployer parity risk.
Acceptance test (disabled-component negative case): with nvidia-dra-driver-gpudisabled in the recipe, assert the helper does not inject the annotation anywhere AND does not emit a warning. Proves the gating is correct.
Near-term guardrail (optional, ship if bundler change slips): make target that fails when the resolved gpu-operator chart version diverges from the DRA annotation. Compares against resolved recipe (not registry default).
Cluster verification: on a cluster running an older gpu-operator, aicr bundle + deploy via the chosen deployer rolls the DRA pods automatically, and aicr validate --phase performance passes without manual intervention.
Out of scope
Bumping nvidia-dra-driver-gpu itself (still v25.12.0; not driver-version-bound in its registry pin).
Coverage for users running the upstream k8s-dra-driver-gpu Helm chart directly without AICR artifacts — only an upstream chart change can cover that population. Worth a separate upstream conversation, not AICR-tracked.
Credit: v2 reframing (issue is a durability question, not the original bug) from Codex review of v1. v3 refinements (deploy.sh hook already present, bundler-derived annotation as the right primary, resolved-recipe basis for the coherence check, generated-artifact tests, upstream-chart-direct caveat) from Codex review of v2. v4 refinements (filtered-recipe gating, exact injection point between extractComponentValues and buildDeployer, "default Helm deployer" terminology, disabled-component negative test) from Codex review of v3.
Summary
PR #965 mitigates the stale-NVML class of bug by hard-coding the gpu-operator chart version into a DRA pod-template annotation. The annotation works as long as it's kept in lockstep with the gpu-operator chart pin — which depends on a maintainer remembering to bump both in the same PR.
That's brittle. A future gpu-operator bump that forgets the annotation will silently re-introduce the original failure (stale NVML, DRA
FailedPrepareDynamicResources, GPU workloads stuck inContainerCreating).This issue tracks the durable fix.
Pre-PR-#965 repro (context)
After upgrading gpu-operator from v25.10.1 (driver
580.105.08) → v26.3.1 (driver580.126.20), thenvidia-dra-driver-gpukubelet-plugin DaemonSet pod is not restarted byhelm upgrade. Its in-pod NVML handle remains bound to the old driver, and DRA allocation fails with:Workloads that request GPUs via DRA (Dynamo
DynamoGraphDeployment,aicr validate --phase performanceinference-perf, etc.) stay inContainerCreatingindefinitely. Manual workaround pre-#965:kubectl delete pod -n nvidia-dra-driver -l app.kubernetes.io/name=nvidia-dra-driver-gpu.What's already in place (don't reinvent)
deploy.sh.tmpl(pkg/bundler/deployer/helm/templates/deploy.sh.tmpl:354) already emits a post-installkubectl rollout restartfor the DRA kubelet-plugin DaemonSet, gated onname == "nvidia-dra-driver-gpu". So:./deploy.sh(default Helm deployer) get the restart automatically.helmfile,flux,argocd, orargocd-helmartifacts — there's no equivalent in those deployer templates (grep -rn "kubelet-plugin" pkg/bundler/deployer/{helmfile,flux,argocd*}/returns zero matches).helm upgradedirectly on the upstream chart without going through AICR — out of scope for AICR; only an upstream chart change can cover them.(
localformatis the shared folder writer that all deployers compose, not a user-facing deployer itself.)PR #965's annotation closes the GitOps-deployer gap by making the pod-template change visible in every deployer's rendered artifact — they all re-roll the DaemonSet on next reconcile. The durability question is how to keep the annotation correct across future chart bumps.
Why PR #965's mitigation isn't durable
gpu-operator's chart version.gpu-operator(e.g. v26.4.0) but forgets the annotation produces identical rendered DaemonSet manifests fornvidia-dra-driver-gpu.helm upgradeskips the roll. Stale NVML returns.make qualifydoes not catch this — no static or runtime check verifies the two pins are coherent.Recommended durable fix: bundler-derived annotation
Inject the annotation at bundle-generation time from the resolved recipe, not at hand-edit time.
Concrete shape:
gpu-operatorComponentRef.Versionfrom the resolved recipe (the same value the resolver writes into the bundle for the gpu-operator chart).controller.podAnnotations.aicr.nvidia.com/gpu-operator-chart-versionkubeletPlugin.podAnnotations.aicr.nvidia.com/gpu-operator-chart-versionof the
nvidia-dra-driver-gpurendered values.gpu-operatorandnvidia-dra-driver-gpuare enabled in the filtered resolved recipe.DefaultBundler.Make(pkg/bundler/bundler.go:213-254) builds anenabledRefsslice that excludes disabled components before any value extraction; the helper should consume that filteredrecipeResult.ComponentRefs, not the unfiltered recipe.DefaultBundler.Make, afterextractComponentValues(line 277) has loaded recipe values, applied user--setoverrides, and applied scheduling overrides — but beforebuildDeployer(line 311). This window guarantees:componentValuesmap;--set gpuoperator:rollout-trigger=foodoes not accidentally defeat the internal rollout annotation (unless AICR intentionally decides to allow that, which would be a separate explicit knob).recipes/components/nvidia-dra-driver-gpu/values.yamlthen carries no chart-version string at all — that data lives only in the resolved recipe.Why this is the right shape:
componentValuesfrom the same code path, so all of them get the pod-template change.gpu-operator's pin gets the DRA annotation update for free in the bundle output.Near-term guardrail (safe to ship while the bundler-derived fix is in flight)
A static coherence check, ideally wired into
make qualifyor a newmake check-dra-rollout-trigger:gpu-operatorComponentRef.Versionagainst the annotation value inrecipes/components/nvidia-dra-driver-gpu/values.yaml. Fail if they differ.recipes/components/gpu-operator/values.yamlorhelm.defaultVersionin the registry. This aligns with chore(recipes): make resolved recipes the single source of truth for chart versions #966 (resolved recipe as single source of truth for chart versions) and avoids creating a new drift point.This is the cheapest guardrail and is worth shipping immediately if the bundler-derived fix isn't ready for the same release.
Definition of done
DefaultBundler.Makeinjectsaicr.nvidia.com/gpu-operator-chart-version(or equivalent annotation) into the renderednvidia-dra-driver-gpucomponentValuesfor bothcontrollerandkubeletPlugin, sourced from the resolved gpu-operatorComponentRef.Version, gated on both refs being enabled in the filtered recipe, injected afterextractComponentValuesand beforebuildDeployer. Maintainer-facing values file no longer carries the chart-version string.podAnnotationschange in at least the default Helm-deployer output AND one GitOps-deployer output (helmfile or flux or argocd). Catches the cross-deployer parity risk.nvidia-dra-driver-gpudisabled in the recipe, assert the helper does not inject the annotation anywhere AND does not emit a warning. Proves the gating is correct.maketarget that fails when the resolved gpu-operator chart version diverges from the DRA annotation. Compares against resolved recipe (not registry default).aicr bundle+ deploy via the chosen deployer rolls the DRA pods automatically, andaicr validate --phase performancepasses without manual intervention.Out of scope
nvidia-dra-driver-gpuitself (still v25.12.0; not driver-version-bound in its registry pin).k8s-dra-driver-gpuHelm chart directly without AICR artifacts — only an upstream chart change can cover that population. Worth a separate upstream conversation, not AICR-tracked.Related
Credit: v2 reframing (issue is a durability question, not the original bug) from Codex review of v1. v3 refinements (deploy.sh hook already present, bundler-derived annotation as the right primary, resolved-recipe basis for the coherence check, generated-artifact tests, upstream-chart-direct caveat) from Codex review of v2. v4 refinements (filtered-recipe gating, exact injection point between
extractComponentValuesandbuildDeployer, "default Helm deployer" terminology, disabled-component negative test) from Codex review of v3.