From 7c920d07edde9362d7b3fe3faa09de5181430695 Mon Sep 17 00:00:00 2001 From: Yuan Chen Date: Tue, 19 May 2026 07:48:27 -0700 Subject: [PATCH] chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20 (#894) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Lift every leaf recipe to the current upstream stable gpu-operator chart (v26.3.1, published 2026-04-18). The registry already defaulted to v26.3.1, but base.yaml shadowed it with v25.10.1 and the aks/oke overlays held v26.3.0 — so leaves actually shipped v25.10.1 / v26.3.0 images even though the BOM advertised v26.3.1. - recipes/overlays/base.yaml: v25.10.1 -> v26.3.1 - recipes/overlays/aks.yaml: v26.3.0 -> v26.3.1 - recipes/overlays/oke.yaml: v26.3.0 -> v26.3.1 Driver pin moves to NVIDIA's v26.3.1-qualified version (580.126.20). This also matches the GB200+EFA floor, so the per-overlay driver.version: 580.126.20 in gb200-eks-{training, inference}.yaml becomes redundant and is dropped; kernelModuleConfig stays. Upstream v26.3.x flipped ccManager to enabled=true,defaultMode=on by default. Pin ccManager.enabled: false in the global values.yaml so we do not silently turn on Confidential Compute Manager on clusters without CC-capable hardware. Revisit when AICR adds explicit CC support. Flip driver.rdma.enabled global default true -> false. v26.3.1's driver-validation init container now hard-gates on lsmod | grep nvidia_peermem; the module only loads against Mellanox MOFED symbols, so AWS EFA (EKS p4d/p5/p5e) and Linode hosts trap the validator forever even though NCCL on EFA uses aws-ofi-nccl / libfabric and does not need nvidia_peermem. Re-enable rdma explicitly in components/gpu-operator/values-aks.yaml and values-aks-training.yaml — AKS overlays deploy network-operator which installs MOFED on ND-series InfiniBand nodes, so peermem has Mellanox symbols to bind against. Set driver.rdma.useHostMofed: true alongside, so the driver container binds nvidia_peermem against the network-operator-installed host MOFED instead of building its own bundled MOFED inside the container. GKE-cos / OKE / kind keep driver.enabled: false (host-managed driver) so the flag is moot; EKS / LKE inherit the new safe default. cuj1-training chainsaw assertion updated to match. Add gpu-operator-chart-version pod annotation to the DRA driver in recipes/components/nvidia-dra-driver-gpu/values.yaml (kubeletPlugin and controller). gpu-operator's k8s-driver-manager reloads the host NVIDIA kernel modules during a driver bump but does NOT restart the sibling nvidia-dra-driver-gpu DaemonSet because its chart template is unchanged. The DRA kubelet plugin loads libnvidia-ml.so at pod start and pins to the running driver version, so a kernel-module reload leaves the pod with a stale NVML handle; CDI spec generation then fails with Driver/library version mismatch and DRA-allocated workloads stay in ContainerCreating. Bumping the annotation value on every gpu-operator chart bump forces a DaemonSet re-roll across every deployer (Helm, helmfile, Flux, Argo CD), refreshing NVML against the now-running driver. The deploy.sh Helm-deployer template already restarts the kubelet plugin post-install; the annotation closes the gap for GitOps-deployer artifacts. Manual coupling of the annotation value to the chart version is a known maintenance gap tracked in issue #973 (bundler-derived annotation as the durable fix). Verified live on aicr2: applying the new bundle rolled both DRA pods cleanly, and aicr validate --phase performance passed end-to-end (inference-perf 39.7k tok/s, TTFT p99 122ms). Gate the existing post-install DRA kubelet-plugin restart on gpu-operator's per-node driver migration completing. The annotation-based re-roll above and the existing kubectl rollout restart in deploy.sh both fire at helm-upgrade time, but k8s-driver-manager runs the per-node module reload asynchronously after `helm upgrade gpu-operator` returns. On a multi-GPU-node cluster, the DRA plugin can re-roll on a node whose driver migration has not yet started, get its NVML handle stuck to the pre-migration state, and produce "invalid CDI Spec: empty device edits" once the modules reload underneath it. Reproduced live on a GB200 EKS cluster (yljtrxpmzu) during PR validation: the chart-version annotation re-rolled the DRA pods correctly, but the second GB200 node's driver migration ran after that, leaving its kubelet-plugin pod with a stale NVML view. Adding a kubectl wait for nvidia.com/gpu-driver-upgrade-state=upgrade-done on every GPU node before the existing post-install rollout-restart closes this timing race for the default Helm-deployer flow. See #973 for broader cross-deployer coverage (a Helm post-install hook on gpu-operator-post would cover helmfile/Flux/Argo CD too). BOM (docs/user/container-images.md) regenerated; only the driver image moves, since the registry-driven BOM was already on v26.3.1. The BOM-vs-leaf drift class itself is tracked separately in #966. --- docs/user/container-images.md | 2 +- .../deployer/helm/templates/deploy.sh.tmpl | 40 +++++++++++++++++++ .../gpu-operator/values-aks-training.yaml | 9 +++++ .../components/gpu-operator/values-aks.yaml | 13 ++++++ recipes/components/gpu-operator/values.yaml | 16 +++++++- .../nvidia-dra-driver-gpu/values.yaml | 18 +++++++++ recipes/overlays/aks.yaml | 2 +- recipes/overlays/base.yaml | 2 +- recipes/overlays/gb200-eks-inference.yaml | 1 - recipes/overlays/gb200-eks-training.yaml | 4 -- recipes/overlays/oke.yaml | 2 +- .../assert-bundle-scheduling.yaml | 8 +++- 12 files changed, 104 insertions(+), 13 deletions(-) diff --git a/docs/user/container-images.md b/docs/user/container-images.md index ab2dbc9a4..47b7ed57d 100644 --- a/docs/user/container-images.md +++ b/docs/user/container-images.md @@ -107,7 +107,7 @@ _No images extracted._ - `nvcr.io/nvidia/cloud-native/nvidia-fs:2.27.3` - `nvcr.io/nvidia/cloud-native/nvidia-sandbox-device-plugin:v0.0.3` - `nvcr.io/nvidia/cloud-native/vgpu-device-manager:v0.4.2` -- `nvcr.io/nvidia/driver:580.105.08` +- `nvcr.io/nvidia/driver:580.126.20` - `nvcr.io/nvidia/gpu-operator:v26.3.1` - `nvcr.io/nvidia/k8s-device-plugin:v0.19.0` - `nvcr.io/nvidia/k8s/container-toolkit:v1.19.0` diff --git a/pkg/bundler/deployer/helm/templates/deploy.sh.tmpl b/pkg/bundler/deployer/helm/templates/deploy.sh.tmpl index 0b85ea11f..be6b03f3d 100644 --- a/pkg/bundler/deployer/helm/templates/deploy.sh.tmpl +++ b/pkg/bundler/deployer/helm/templates/deploy.sh.tmpl @@ -346,6 +346,46 @@ for dir in "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/; do {{- range .Components }} {{- if eq .Name "nvidia-dra-driver-gpu" }} if [[ "${name}" == "nvidia-dra-driver-gpu" ]]; then + # gpu-operator's k8s-driver-manager reloads NVIDIA kernel modules + # asynchronously per-node after `helm upgrade gpu-operator` returns. + # If the DRA kubelet plugin pod re-rolls (via the chart's + # podAnnotations or the post-install restart below) before those + # reloads finish on every managed GPU node, the freshly-started + # plugin can pin its NVML handle to the now-stale driver state on + # a not-yet-migrated node. CDI spec generation then fails with + # "invalid CDI Spec: empty device edits" and DRA-allocated pods + # stay in ContainerCreating until the plugin is restarted again. + # Wait for the migration to settle on every managed node before + # touching the plugin. See issue #973. + # + # Two gates skip this wait safely: + # 1. The gpu-operator nvidia-driver-daemonset is absent — host- + # managed-driver recipes (GKE COS, OKE, Kind, etc.) run + # gpu-operator with driver.enabled=false, so the DaemonSet is + # never created. Look it up by name across all namespaces so + # we discover it whichever namespace the operator runs in + # (e.g. os-talos moves it to "privileged-gpu-operator"). + # 2. No nodes carry nvidia.com/gpu.deploy.driver=true — the + # operator's reconciler sets that label only on nodes its + # driver DaemonSet selects (this respects + # --accelerated-node-selector). Waiting on every + # gpu.present=true node would block until the 15-min timeout + # for any GPU node the operator deliberately excludes. + DRIVER_DS_NS=$(kubectl get daemonset -A -o jsonpath='{.items[?(@.metadata.name=="nvidia-driver-daemonset")].metadata.namespace}' 2>/dev/null | awk '{print $1}') + if [[ -z "${DRIVER_DS_NS}" ]]; then + echo " gpu-operator nvidia-driver-daemonset not present (host-managed driver); skipping migration wait" + else + MANAGED_NODES=$(kubectl get nodes -l nvidia.com/gpu.deploy.driver=true -o name 2>/dev/null | wc -l | tr -d ' ') + if [[ "${MANAGED_NODES}" -gt 0 ]]; then + echo " Waiting for gpu-operator driver migration on ${MANAGED_NODES} managed GPU node(s) to reach upgrade-done (ns=${DRIVER_DS_NS})..." + if ! kubectl wait --for=jsonpath='{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}=upgrade-done' \ + nodes -l nvidia.com/gpu.deploy.driver=true --timeout=15m; then + echo " WARNING: not all managed GPU nodes reached upgrade-done within 15m; proceeding with restart anyway" + fi + else + echo " No nodes labeled nvidia.com/gpu.deploy.driver=true yet; skipping migration wait" + fi + fi # Best-effort mitigation for kubelet DRA plugin registration drift. # After uninstall/reinstall, kubelet's fsnotify watcher may not detect new # registration sockets. Restarting the plugin DS forces fresh socket creation. diff --git a/recipes/components/gpu-operator/values-aks-training.yaml b/recipes/components/gpu-operator/values-aks-training.yaml index 585e77041..21197f9f8 100644 --- a/recipes/components/gpu-operator/values-aks-training.yaml +++ b/recipes/components/gpu-operator/values-aks-training.yaml @@ -23,6 +23,15 @@ cdi: toolkit: enabled: false +# Re-enable GPUDirect RDMA. AKS ships MOFED via network-operator on +# ND-series InfiniBand nodes; useHostMofed: true binds nvidia_peermem +# against the host's MOFED kernel modules (see values-aks.yaml for the +# full rationale). +driver: + rdma: + enabled: true + useHostMofed: true + validator: plugin: env: diff --git a/recipes/components/gpu-operator/values-aks.yaml b/recipes/components/gpu-operator/values-aks.yaml index aaed6e1d9..a4fd1600c 100644 --- a/recipes/components/gpu-operator/values-aks.yaml +++ b/recipes/components/gpu-operator/values-aks.yaml @@ -28,6 +28,19 @@ nfd: enabled: false +# Re-enable GPUDirect RDMA. The AKS overlay deploys network-operator, +# which installs MOFED kernel modules on the host (ND-series InfiniBand +# nodes). useHostMofed: true tells gpu-operator's driver container to +# bind nvidia_peermem against the host's MOFED symbols instead of +# building its own bundled MOFED — required for the network-operator +# integration to actually work end-to-end. The global default in +# components/gpu-operator/values.yaml is off because EFA / non-MOFED +# fabrics trip v26.3.1's driver-validation gate. +driver: + rdma: + enabled: true + useHostMofed: true + # The following flags are set in the aks-rdma-infiniband reference configuration # but are not required for RDMA functionality. They suppress DaemonSets that # serve no purpose on AKS ND-H100 nodes. Uncomment if your deployment needs them. diff --git a/recipes/components/gpu-operator/values.yaml b/recipes/components/gpu-operator/values.yaml index 625965e69..67f3a80f7 100644 --- a/recipes/components/gpu-operator/values.yaml +++ b/recipes/components/gpu-operator/values.yaml @@ -137,12 +137,19 @@ gfd: enabled: true driver: - version: 580.105.08 + # NVIDIA's recommended driver for the v26.3.1 chart; matches the + # GB200+EFA floor so a single global pin covers H100/B200/GB200 EKS. + version: 580.126.20 enabled: true useOpenKernelModules: true maxParallelUpgrades: 5 rdma: - enabled: true + # Default off: nvidia_peermem only loads against Mellanox MOFED + # symbols. AWS EFA (EKS p4d/p5/p5e) and Linode have no MOFED, so + # peermem fails to load and v26.3.1's stricter driver-validation + # init container blocks the rest of the GPU stack. Overlays that + # ship MOFED (AKS via network-operator) explicitly re-enable this. + enabled: false devicePlugin: env: @@ -166,3 +173,8 @@ validator: # NFD deployed as standalone shared component — disable sub-chart nfd: enabled: false + +# Confidential Compute Manager defaults to enabled in chart v26.3.x; keep +# it off until AICR has explicit CC-capable hardware support. +ccManager: + enabled: false diff --git a/recipes/components/nvidia-dra-driver-gpu/values.yaml b/recipes/components/nvidia-dra-driver-gpu/values.yaml index 7a2c17170..117b11962 100644 --- a/recipes/components/nvidia-dra-driver-gpu/values.yaml +++ b/recipes/components/nvidia-dra-driver-gpu/values.yaml @@ -56,7 +56,25 @@ resources: gpus: enabled: true +# gpu-operator-chart-version annotation forces a DaemonSet re-roll when +# the gpu-operator chart (and its managed driver) bumps. The DRA +# kubelet plugin loads libnvidia-ml.so at pod start and pins to the +# driver version running at that moment; gpu-operator's k8s-driver-manager +# reloads the host kernel modules during a driver bump but does NOT +# restart the sibling DRA DaemonSet (its chart template hasn't changed), +# leaving the kubelet plugin's NVML handle stale. CDI spec generation +# then fails with "Driver/library version mismatch" and DRA-allocated +# pods stay in ContainerCreating. +# +# Bumping this annotation value on every gpu-operator chart bump (here: +# v26.3.1) changes the rendered pod template and forces helm upgrade to +# roll the DaemonSet, picking up a fresh NVML handle against the +# now-running driver. Track follow-up to automate this in #973. controller: priorityClassName: "" + podAnnotations: + aicr.nvidia.com/gpu-operator-chart-version: v26.3.1 kubeletPlugin: priorityClassName: "" + podAnnotations: + aicr.nvidia.com/gpu-operator-chart-version: v26.3.1 diff --git a/recipes/overlays/aks.yaml b/recipes/overlays/aks.yaml index 0a113404a..b8dfee375 100644 --- a/recipes/overlays/aks.yaml +++ b/recipes/overlays/aks.yaml @@ -42,7 +42,7 @@ spec: # AKS pre-installs NVIDIA container toolkit; disable toolkit installation - name: gpu-operator type: Helm - version: "v26.3.0" + version: "v26.3.1" valuesFile: components/gpu-operator/values-aks.yaml dependencyRefs: - network-operator diff --git a/recipes/overlays/base.yaml b/recipes/overlays/base.yaml index e16d0bf10..88a60ba1c 100644 --- a/recipes/overlays/base.yaml +++ b/recipes/overlays/base.yaml @@ -40,7 +40,7 @@ spec: - name: gpu-operator type: Helm source: https://helm.ngc.nvidia.com/nvidia - version: v25.10.1 + version: v26.3.1 valuesFile: components/gpu-operator/values.yaml dependencyRefs: - nfd diff --git a/recipes/overlays/gb200-eks-inference.yaml b/recipes/overlays/gb200-eks-inference.yaml index 2027beed2..7a9e0a2c9 100644 --- a/recipes/overlays/gb200-eks-inference.yaml +++ b/recipes/overlays/gb200-eks-inference.yaml @@ -56,7 +56,6 @@ spec: gdrcopy: enabled: true driver: - version: 580.126.20 kernelModuleConfig: name: nvidia-kernel-module-params diff --git a/recipes/overlays/gb200-eks-training.yaml b/recipes/overlays/gb200-eks-training.yaml index a11d903d9..b084b3e9d 100644 --- a/recipes/overlays/gb200-eks-training.yaml +++ b/recipes/overlays/gb200-eks-training.yaml @@ -61,10 +61,6 @@ spec: gdrcopy: enabled: true driver: - # 580.126.20 is NVIDIA's recommended floor for GB200+EFA; the global - # default (580.105.08 in components/gpu-operator/values.yaml) stays - # unchanged for H100/B200 and non-EKS GB200 recipes. - version: 580.126.20 kernelModuleConfig: name: nvidia-kernel-module-params diff --git a/recipes/overlays/oke.yaml b/recipes/overlays/oke.yaml index f87df7a62..7f381a25c 100644 --- a/recipes/overlays/oke.yaml +++ b/recipes/overlays/oke.yaml @@ -40,7 +40,7 @@ spec: # (BM.GPU.B200, BM.GPU.H100, etc.). Disable both to avoid conflicts. - name: gpu-operator type: Helm - version: v26.3.0 + version: v26.3.1 valuesFile: components/gpu-operator/values-oke.yaml # Prometheus persistent storage (provide --storage-class at bundle time, e.g. oci-bv) diff --git a/tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling.yaml b/tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling.yaml index 6e4477f30..de1772efe 100644 --- a/tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling.yaml +++ b/tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling.yaml @@ -37,11 +37,15 @@ daemonsets: value: present effect: NoSchedule -# ── Driver: RDMA required for multi-node training ──────────────────── +# ── Driver: nvidia_peermem off on EKS (AWS EFA path uses aws-ofi-nccl) +# nvidia_peermem only loads against Mellanox MOFED symbols; on AWS EFA +# (p4d/p5/p5e) it fails to load and v26.3.1's strict driver-validation +# init container blocks the rest of the GPU stack. NCCL multi-node on +# EFA uses libfabric via aws-ofi-nccl, not nvidia_peermem. driver: enabled: true rdma: - enabled: true + enabled: false useOpenKernelModules: true # ── GDRCopy: GPU-direct memory for high-performance training ─────────