From 7c920d07edde9362d7b3fe3faa09de5181430695 Mon Sep 17 00:00:00 2001
From: Yuan Chen <yuanchen97@gmail.com>
Date: Tue, 19 May 2026 07:48:27 -0700
Subject: [PATCH] chore(recipes): bump gpu-operator chart to v26.3.1 and driver
 to 580.126.20 (#894)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Lift every leaf recipe to the current upstream stable gpu-operator
chart (v26.3.1, published 2026-04-18). The registry already defaulted
to v26.3.1, but base.yaml shadowed it with v25.10.1 and the aks/oke
overlays held v26.3.0 — so leaves actually shipped v25.10.1 / v26.3.0
images even though the BOM advertised v26.3.1.

- recipes/overlays/base.yaml:    v25.10.1 -> v26.3.1
- recipes/overlays/aks.yaml:     v26.3.0  -> v26.3.1
- recipes/overlays/oke.yaml:     v26.3.0  -> v26.3.1

Driver pin moves to NVIDIA's v26.3.1-qualified version
(580.126.20). This also matches the GB200+EFA floor, so the
per-overlay driver.version: 580.126.20 in gb200-eks-{training,
inference}.yaml becomes redundant and is dropped; kernelModuleConfig
stays.

Upstream v26.3.x flipped ccManager to enabled=true,defaultMode=on by
default. Pin ccManager.enabled: false in the global values.yaml so we
do not silently turn on Confidential Compute Manager on clusters
without CC-capable hardware. Revisit when AICR adds explicit CC
support.

Flip driver.rdma.enabled global default true -> false. v26.3.1's
driver-validation init container now hard-gates on lsmod | grep
nvidia_peermem; the module only loads against Mellanox MOFED
symbols, so AWS EFA (EKS p4d/p5/p5e) and Linode hosts trap the
validator forever even though NCCL on EFA uses aws-ofi-nccl /
libfabric and does not need nvidia_peermem. Re-enable rdma
explicitly in components/gpu-operator/values-aks.yaml and
values-aks-training.yaml — AKS overlays deploy network-operator
which installs MOFED on ND-series InfiniBand nodes, so peermem
has Mellanox symbols to bind against. Set driver.rdma.useHostMofed:
true alongside, so the driver container binds nvidia_peermem
against the network-operator-installed host MOFED instead of
building its own bundled MOFED inside the container. GKE-cos /
OKE / kind keep driver.enabled: false (host-managed driver) so
the flag is moot; EKS / LKE inherit the new safe default.
cuj1-training chainsaw assertion updated to match.

Add gpu-operator-chart-version pod annotation to the DRA driver
in recipes/components/nvidia-dra-driver-gpu/values.yaml
(kubeletPlugin and controller). gpu-operator's k8s-driver-manager
reloads the host NVIDIA kernel modules during a driver bump but
does NOT restart the sibling nvidia-dra-driver-gpu DaemonSet
because its chart template is unchanged. The DRA kubelet plugin
loads libnvidia-ml.so at pod start and pins to the running driver
version, so a kernel-module reload leaves the pod with a stale
NVML handle; CDI spec generation then fails with Driver/library
version mismatch and DRA-allocated workloads stay in
ContainerCreating. Bumping the annotation value on every
gpu-operator chart bump forces a DaemonSet re-roll across every
deployer (Helm, helmfile, Flux, Argo CD), refreshing NVML against
the now-running driver. The deploy.sh Helm-deployer template
already restarts the kubelet plugin post-install; the annotation
closes the gap for GitOps-deployer artifacts. Manual coupling of
the annotation value to the chart version is a known maintenance
gap tracked in issue #973 (bundler-derived annotation as the
durable fix). Verified live on aicr2: applying the new bundle
rolled both DRA pods cleanly, and aicr validate --phase
performance passed end-to-end (inference-perf 39.7k tok/s,
TTFT p99 122ms).

Gate the existing post-install DRA kubelet-plugin restart on
gpu-operator's per-node driver migration completing. The
annotation-based re-roll above and the existing kubectl rollout
restart in deploy.sh both fire at helm-upgrade time, but
k8s-driver-manager runs the per-node module reload asynchronously
after `helm upgrade gpu-operator` returns. On a multi-GPU-node
cluster, the DRA plugin can re-roll on a node whose driver
migration has not yet started, get its NVML handle stuck to the
pre-migration state, and produce "invalid CDI Spec: empty device
edits" once the modules reload underneath it. Reproduced live on
a GB200 EKS cluster (yljtrxpmzu) during PR validation: the
chart-version annotation re-rolled the DRA pods correctly, but
the second GB200 node's driver migration ran after that, leaving
its kubelet-plugin pod with a stale NVML view. Adding a kubectl
wait for nvidia.com/gpu-driver-upgrade-state=upgrade-done on every
GPU node before the existing post-install rollout-restart closes
this timing race for the default Helm-deployer flow. See #973 for
broader cross-deployer coverage (a Helm post-install hook on
gpu-operator-post would cover helmfile/Flux/Argo CD too).

BOM (docs/user/container-images.md) regenerated; only the driver
image moves, since the registry-driven BOM was already on v26.3.1.
The BOM-vs-leaf drift class itself is tracked separately in #966.
---
 docs/user/container-images.md                 |  2 +-
 .../deployer/helm/templates/deploy.sh.tmpl    | 40 +++++++++++++++++++
 .../gpu-operator/values-aks-training.yaml     |  9 +++++
 .../components/gpu-operator/values-aks.yaml   | 13 ++++++
 recipes/components/gpu-operator/values.yaml   | 16 +++++++-
 .../nvidia-dra-driver-gpu/values.yaml         | 18 +++++++++
 recipes/overlays/aks.yaml                     |  2 +-
 recipes/overlays/base.yaml                    |  2 +-
 recipes/overlays/gb200-eks-inference.yaml     |  1 -
 recipes/overlays/gb200-eks-training.yaml      |  4 --
 recipes/overlays/oke.yaml                     |  2 +-
 .../assert-bundle-scheduling.yaml             |  8 +++-
 12 files changed, 104 insertions(+), 13 deletions(-)

diff --git a/docs/user/container-images.md b/docs/user/container-images.md
index ab2dbc9a4..47b7ed57d 100644
--- a/docs/user/container-images.md
+++ b/docs/user/container-images.md
@@ -107,7 +107,7 @@ _No images extracted._
 - `nvcr.io/nvidia/cloud-native/nvidia-fs:2.27.3`
 - `nvcr.io/nvidia/cloud-native/nvidia-sandbox-device-plugin:v0.0.3`
 - `nvcr.io/nvidia/cloud-native/vgpu-device-manager:v0.4.2`
-- `nvcr.io/nvidia/driver:580.105.08`
+- `nvcr.io/nvidia/driver:580.126.20`
 - `nvcr.io/nvidia/gpu-operator:v26.3.1`
 - `nvcr.io/nvidia/k8s-device-plugin:v0.19.0`
 - `nvcr.io/nvidia/k8s/container-toolkit:v1.19.0`
diff --git a/pkg/bundler/deployer/helm/templates/deploy.sh.tmpl b/pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
index 0b85ea11f..be6b03f3d 100644
--- a/pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
+++ b/pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
@@ -346,6 +346,46 @@ for dir in "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/; do
   {{- range .Components }}
   {{- if eq .Name "nvidia-dra-driver-gpu" }}
   if [[ "${name}" == "nvidia-dra-driver-gpu" ]]; then
+    # gpu-operator's k8s-driver-manager reloads NVIDIA kernel modules
+    # asynchronously per-node after `helm upgrade gpu-operator` returns.
+    # If the DRA kubelet plugin pod re-rolls (via the chart's
+    # podAnnotations or the post-install restart below) before those
+    # reloads finish on every managed GPU node, the freshly-started
+    # plugin can pin its NVML handle to the now-stale driver state on
+    # a not-yet-migrated node. CDI spec generation then fails with
+    # "invalid CDI Spec: empty device edits" and DRA-allocated pods
+    # stay in ContainerCreating until the plugin is restarted again.
+    # Wait for the migration to settle on every managed node before
+    # touching the plugin. See issue #973.
+    #
+    # Two gates skip this wait safely:
+    #   1. The gpu-operator nvidia-driver-daemonset is absent — host-
+    #      managed-driver recipes (GKE COS, OKE, Kind, etc.) run
+    #      gpu-operator with driver.enabled=false, so the DaemonSet is
+    #      never created. Look it up by name across all namespaces so
+    #      we discover it whichever namespace the operator runs in
+    #      (e.g. os-talos moves it to "privileged-gpu-operator").
+    #   2. No nodes carry nvidia.com/gpu.deploy.driver=true — the
+    #      operator's reconciler sets that label only on nodes its
+    #      driver DaemonSet selects (this respects
+    #      --accelerated-node-selector). Waiting on every
+    #      gpu.present=true node would block until the 15-min timeout
+    #      for any GPU node the operator deliberately excludes.
+    DRIVER_DS_NS=$(kubectl get daemonset -A -o jsonpath='{.items[?(@.metadata.name=="nvidia-driver-daemonset")].metadata.namespace}' 2>/dev/null | awk '{print $1}')
+    if [[ -z "${DRIVER_DS_NS}" ]]; then
+      echo "  gpu-operator nvidia-driver-daemonset not present (host-managed driver); skipping migration wait"
+    else
+      MANAGED_NODES=$(kubectl get nodes -l nvidia.com/gpu.deploy.driver=true -o name 2>/dev/null | wc -l | tr -d ' ')
+      if [[ "${MANAGED_NODES}" -gt 0 ]]; then
+        echo "  Waiting for gpu-operator driver migration on ${MANAGED_NODES} managed GPU node(s) to reach upgrade-done (ns=${DRIVER_DS_NS})..."
+        if ! kubectl wait --for=jsonpath='{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}=upgrade-done' \
+             nodes -l nvidia.com/gpu.deploy.driver=true --timeout=15m; then
+          echo "  WARNING: not all managed GPU nodes reached upgrade-done within 15m; proceeding with restart anyway"
+        fi
+      else
+        echo "  No nodes labeled nvidia.com/gpu.deploy.driver=true yet; skipping migration wait"
+      fi
+    fi
     # Best-effort mitigation for kubelet DRA plugin registration drift.
     # After uninstall/reinstall, kubelet's fsnotify watcher may not detect new
     # registration sockets. Restarting the plugin DS forces fresh socket creation.
diff --git a/recipes/components/gpu-operator/values-aks-training.yaml b/recipes/components/gpu-operator/values-aks-training.yaml
index 585e77041..21197f9f8 100644
--- a/recipes/components/gpu-operator/values-aks-training.yaml
+++ b/recipes/components/gpu-operator/values-aks-training.yaml
@@ -23,6 +23,15 @@ cdi:
 toolkit:
   enabled: false
 
+# Re-enable GPUDirect RDMA. AKS ships MOFED via network-operator on
+# ND-series InfiniBand nodes; useHostMofed: true binds nvidia_peermem
+# against the host's MOFED kernel modules (see values-aks.yaml for the
+# full rationale).
+driver:
+  rdma:
+    enabled: true
+    useHostMofed: true
+
 validator:
   plugin:
     env:
diff --git a/recipes/components/gpu-operator/values-aks.yaml b/recipes/components/gpu-operator/values-aks.yaml
index aaed6e1d9..a4fd1600c 100644
--- a/recipes/components/gpu-operator/values-aks.yaml
+++ b/recipes/components/gpu-operator/values-aks.yaml
@@ -28,6 +28,19 @@
 nfd:
   enabled: false
 
+# Re-enable GPUDirect RDMA. The AKS overlay deploys network-operator,
+# which installs MOFED kernel modules on the host (ND-series InfiniBand
+# nodes). useHostMofed: true tells gpu-operator's driver container to
+# bind nvidia_peermem against the host's MOFED symbols instead of
+# building its own bundled MOFED — required for the network-operator
+# integration to actually work end-to-end. The global default in
+# components/gpu-operator/values.yaml is off because EFA / non-MOFED
+# fabrics trip v26.3.1's driver-validation gate.
+driver:
+  rdma:
+    enabled: true
+    useHostMofed: true
+
 # The following flags are set in the aks-rdma-infiniband reference configuration
 # but are not required for RDMA functionality. They suppress DaemonSets that
 # serve no purpose on AKS ND-H100 nodes. Uncomment if your deployment needs them.
diff --git a/recipes/components/gpu-operator/values.yaml b/recipes/components/gpu-operator/values.yaml
index 625965e69..67f3a80f7 100644
--- a/recipes/components/gpu-operator/values.yaml
+++ b/recipes/components/gpu-operator/values.yaml
@@ -137,12 +137,19 @@ gfd:
   enabled: true
 
 driver:
-  version: 580.105.08
+  # NVIDIA's recommended driver for the v26.3.1 chart; matches the
+  # GB200+EFA floor so a single global pin covers H100/B200/GB200 EKS.
+  version: 580.126.20
   enabled: true
   useOpenKernelModules: true
   maxParallelUpgrades: 5
   rdma:
-    enabled: true
+    # Default off: nvidia_peermem only loads against Mellanox MOFED
+    # symbols. AWS EFA (EKS p4d/p5/p5e) and Linode have no MOFED, so
+    # peermem fails to load and v26.3.1's stricter driver-validation
+    # init container blocks the rest of the GPU stack. Overlays that
+    # ship MOFED (AKS via network-operator) explicitly re-enable this.
+    enabled: false
 
 devicePlugin:
   env:
@@ -166,3 +173,8 @@ validator:
 # NFD deployed as standalone shared component — disable sub-chart
 nfd:
   enabled: false
+
+# Confidential Compute Manager defaults to enabled in chart v26.3.x; keep
+# it off until AICR has explicit CC-capable hardware support.
+ccManager:
+  enabled: false
diff --git a/recipes/components/nvidia-dra-driver-gpu/values.yaml b/recipes/components/nvidia-dra-driver-gpu/values.yaml
index 7a2c17170..117b11962 100644
--- a/recipes/components/nvidia-dra-driver-gpu/values.yaml
+++ b/recipes/components/nvidia-dra-driver-gpu/values.yaml
@@ -56,7 +56,25 @@ resources:
   gpus:
     enabled: true
 
+# gpu-operator-chart-version annotation forces a DaemonSet re-roll when
+# the gpu-operator chart (and its managed driver) bumps. The DRA
+# kubelet plugin loads libnvidia-ml.so at pod start and pins to the
+# driver version running at that moment; gpu-operator's k8s-driver-manager
+# reloads the host kernel modules during a driver bump but does NOT
+# restart the sibling DRA DaemonSet (its chart template hasn't changed),
+# leaving the kubelet plugin's NVML handle stale. CDI spec generation
+# then fails with "Driver/library version mismatch" and DRA-allocated
+# pods stay in ContainerCreating.
+#
+# Bumping this annotation value on every gpu-operator chart bump (here:
+# v26.3.1) changes the rendered pod template and forces helm upgrade to
+# roll the DaemonSet, picking up a fresh NVML handle against the
+# now-running driver. Track follow-up to automate this in #973.
 controller:
   priorityClassName: ""
+  podAnnotations:
+    aicr.nvidia.com/gpu-operator-chart-version: v26.3.1
 kubeletPlugin:
   priorityClassName: ""
+  podAnnotations:
+    aicr.nvidia.com/gpu-operator-chart-version: v26.3.1
diff --git a/recipes/overlays/aks.yaml b/recipes/overlays/aks.yaml
index 0a113404a..b8dfee375 100644
--- a/recipes/overlays/aks.yaml
+++ b/recipes/overlays/aks.yaml
@@ -42,7 +42,7 @@ spec:
     # AKS pre-installs NVIDIA container toolkit; disable toolkit installation
     - name: gpu-operator
       type: Helm
-      version: "v26.3.0"
+      version: "v26.3.1"
       valuesFile: components/gpu-operator/values-aks.yaml
       dependencyRefs:
         - network-operator
diff --git a/recipes/overlays/base.yaml b/recipes/overlays/base.yaml
index e16d0bf10..88a60ba1c 100644
--- a/recipes/overlays/base.yaml
+++ b/recipes/overlays/base.yaml
@@ -40,7 +40,7 @@ spec:
     - name: gpu-operator
       type: Helm
       source: https://helm.ngc.nvidia.com/nvidia
-      version: v25.10.1
+      version: v26.3.1
       valuesFile: components/gpu-operator/values.yaml
       dependencyRefs:
         - nfd
diff --git a/recipes/overlays/gb200-eks-inference.yaml b/recipes/overlays/gb200-eks-inference.yaml
index 2027beed2..7a9e0a2c9 100644
--- a/recipes/overlays/gb200-eks-inference.yaml
+++ b/recipes/overlays/gb200-eks-inference.yaml
@@ -56,7 +56,6 @@ spec:
         gdrcopy:
           enabled: true
         driver:
-          version: 580.126.20
           kernelModuleConfig:
             name: nvidia-kernel-module-params
 
diff --git a/recipes/overlays/gb200-eks-training.yaml b/recipes/overlays/gb200-eks-training.yaml
index a11d903d9..b084b3e9d 100644
--- a/recipes/overlays/gb200-eks-training.yaml
+++ b/recipes/overlays/gb200-eks-training.yaml
@@ -61,10 +61,6 @@ spec:
         gdrcopy:
           enabled: true
         driver:
-          # 580.126.20 is NVIDIA's recommended floor for GB200+EFA; the global
-          # default (580.105.08 in components/gpu-operator/values.yaml) stays
-          # unchanged for H100/B200 and non-EKS GB200 recipes.
-          version: 580.126.20
           kernelModuleConfig:
             name: nvidia-kernel-module-params
 
diff --git a/recipes/overlays/oke.yaml b/recipes/overlays/oke.yaml
index f87df7a62..7f381a25c 100644
--- a/recipes/overlays/oke.yaml
+++ b/recipes/overlays/oke.yaml
@@ -40,7 +40,7 @@ spec:
     # (BM.GPU.B200, BM.GPU.H100, etc.). Disable both to avoid conflicts.
     - name: gpu-operator
       type: Helm
-      version: v26.3.0
+      version: v26.3.1
       valuesFile: components/gpu-operator/values-oke.yaml
 
     # Prometheus persistent storage (provide --storage-class at bundle time, e.g. oci-bv)
diff --git a/tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling.yaml b/tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling.yaml
index 6e4477f30..de1772efe 100644
--- a/tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling.yaml
+++ b/tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling.yaml
@@ -37,11 +37,15 @@ daemonsets:
       value: present
       effect: NoSchedule
 
-# ── Driver: RDMA required for multi-node training ────────────────────
+# ── Driver: nvidia_peermem off on EKS (AWS EFA path uses aws-ofi-nccl)
+# nvidia_peermem only loads against Mellanox MOFED symbols; on AWS EFA
+# (p4d/p5/p5e) it fails to load and v26.3.1's strict driver-validation
+# init container blocks the rest of the GPU stack. NCCL multi-node on
+# EFA uses libfabric via aws-ofi-nccl, not nvidia_peermem.
 driver:
   enabled: true
   rdma:
-    enabled: true
+    enabled: false
   useOpenKernelModules: true
 
 # ── GDRCopy: GPU-direct memory for high-performance training ─────────