Prerequisites
Bug Description
AICR's kind recipe overlay attempts to keep GPU Operator operands off kind control-plane nodes by setting per-component affinity under dcgmExporter, devicePlugin, gfd, and validator.
The intent is documented in recipes/overlays/kind.yaml:
# Exclude control-plane nodes — NFD labels them with GPU features
# (shared host kernel) but they lack actual device access, causing
# operand init containers to hang and ClusterPolicy to stay notReady.
dcgmExporter:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
# Use chart-default device plugin env in kind (no CDI overrides)
devicePlugin:
env: []
# Exclude control-plane nodes (see dcgmExporter.affinity comment)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
However, for the GPU Operator version currently pinned by recipes/overlays/base.yaml (v25.10.1), these values are not rendered into ClusterPolicy and are not understood by the GPU Operator CRD/controller. The rendered ClusterPolicy.spec.devicePlugin only contains fields such as enabled, repository, image, version, imagePullPolicy, etc.; it does not contain affinity.
As a result, the generated nvidia-device-plugin-daemonset is based on GPU Operator's built-in operand template and effectively schedules using only:
nodeSelector:
nvidia.com/gpu.deploy.device-plugin: "true"
tolerations:
- operator: Exists
In nvkind/kind, the control-plane container can see NVIDIA PCI devices because it is started with the NVIDIA runtime / shared host kernel view. NFD then labels the control-plane node with a NVIDIA PCI feature label, GPU Operator adds nvidia.com/gpu.deploy.device-plugin=true, and the DaemonSet can schedule there because daemonsets.tolerations: [{operator: Exists}] tolerates the control-plane NoSchedule taint.
Impact
Medium. This can cause GPU Operator operands, especially nvidia-device-plugin-daemonset, to schedule onto kind control-plane nodes even though the AICR recipe appears to exclude them. In affected nvkind clusters this can leave operands stuck in init or keep the ClusterPolicy from becoming ready.
Component
Recipe engine / data
Regression?
Unknown / first time using this feature
Steps to Reproduce
- Generate/deploy a kind/nvkind AICR bundle that inherits
recipes/overlays/kind.yaml and recipes/overlays/base.yaml.
- Inspect the generated GPU Operator values for the kind overlay. They include
devicePlugin.affinity that excludes node-role.kubernetes.io/control-plane.
- Run
helm template for the generated GPU Operator bundle:
helm template gpu-operator gpu-operator \
--repo https://helm.ngc.nvidia.com/nvidia \
--version v25.10.1 \
--namespace gpu-operator \
-f values.yaml -f cluster-values.yaml
- Observe that the rendered
ClusterPolicy.spec.devicePlugin does not include affinity.
- Deploy the bundle in an nvkind cluster where the control-plane node is labeled by NFD with a NVIDIA PCI feature label.
- Observe that GPU Operator may add
nvidia.com/gpu.deploy.device-plugin=true to the control-plane node, allowing the generated device-plugin DaemonSet to schedule there unless it is manually patched.
Expected Behavior
The AICR kind/nvkind recipe should reliably prevent GPU Operator operand DaemonSets from scheduling on control-plane nodes, or it should not emit unsupported values that imply this protection exists.
Specifically, nvidia-device-plugin-daemonset should not run on nodes with:
node-role.kubernetes.io/control-plane
when using the AICR kind overlay.
Actual Behavior
AICR emits devicePlugin.affinity, dcgmExporter.affinity, gfd.affinity, and validator.affinity under GPU Operator values, but GPU Operator v25.10.1 does not render or consume these fields. The generated GPU Operator-managed device-plugin DaemonSet can still schedule on control-plane nodes when they are labeled with:
nvidia.com/gpu.deploy.device-plugin=true
Environment
- AICR version: current repository state
- Install method: bundle generated from AICR recipes
- Platform: kind / nvkind
- Kubernetes version: observed with kind node image
kindest/node:v1.35.0
- OS: Ubuntu 22.04 host, Debian 12 kind node image
- Kernel version: 5.15.0-177-generic
- GPU type: observed on a Tesla P4 nvkind host; issue is recipe/chart-version behavior rather than GPU-model specific
- Workload intent: kind training/inference recipes
Command / Request Used
helm template gpu-operator gpu-operator \
--repo https://helm.ngc.nvidia.com/nvidia \
--version v25.10.1 \
--namespace gpu-operator \
-f values.yaml -f cluster-values.yaml
Logs / Error Output
Example rendered ClusterPolicy.spec.devicePlugin from GPU Operator v25.10.1:
devicePlugin:
enabled: true
repository: nvcr.io/nvidia
image: k8s-device-plugin
version: "v0.18.1"
imagePullPolicy: IfNotPresent
No affinity field is present.
Example control-plane labels observed in an nvkind cluster:
feature.node.kubernetes.io/pci-0302_10de.present=true
nvidia.com/gpu.present=true
nvidia.com/gpu.deploy.device-plugin=true
Additional Context
This appears to be a mismatch between AICR's kind overlay and the GPU Operator chart/CRD API. The devicePlugin.affinity style resembles the standalone NVIDIA/k8s-device-plugin Helm chart, which supports top-level .Values.affinity, but GPU Operator does not directly use that chart for its managed nvidia-device-plugin-daemonset.
current v25.10.1 and upstream main do not appear to expose DevicePluginSpec.Affinity in ClusterPolicy.
Prerequisites
Bug Description
AICR's
kindrecipe overlay attempts to keep GPU Operator operands off kind control-plane nodes by setting per-component affinity underdcgmExporter,devicePlugin,gfd, andvalidator.The intent is documented in
recipes/overlays/kind.yaml:However, for the GPU Operator version currently pinned by
recipes/overlays/base.yaml(v25.10.1), these values are not rendered intoClusterPolicyand are not understood by the GPU Operator CRD/controller. The renderedClusterPolicy.spec.devicePluginonly contains fields such asenabled,repository,image,version,imagePullPolicy, etc.; it does not containaffinity.As a result, the generated
nvidia-device-plugin-daemonsetis based on GPU Operator's built-in operand template and effectively schedules using only:In nvkind/kind, the control-plane container can see NVIDIA PCI devices because it is started with the NVIDIA runtime / shared host kernel view. NFD then labels the control-plane node with a NVIDIA PCI feature label, GPU Operator adds
nvidia.com/gpu.deploy.device-plugin=true, and the DaemonSet can schedule there becausedaemonsets.tolerations: [{operator: Exists}]tolerates the control-planeNoScheduletaint.Impact
Medium. This can cause GPU Operator operands, especially
nvidia-device-plugin-daemonset, to schedule onto kind control-plane nodes even though the AICR recipe appears to exclude them. In affected nvkind clusters this can leave operands stuck in init or keep theClusterPolicyfrom becoming ready.Component
Recipe engine / data
Regression?
Unknown / first time using this feature
Steps to Reproduce
recipes/overlays/kind.yamlandrecipes/overlays/base.yaml.devicePlugin.affinitythat excludesnode-role.kubernetes.io/control-plane.helm templatefor the generated GPU Operator bundle:ClusterPolicy.spec.devicePlugindoes not includeaffinity.nvidia.com/gpu.deploy.device-plugin=trueto the control-plane node, allowing the generated device-plugin DaemonSet to schedule there unless it is manually patched.Expected Behavior
The AICR kind/nvkind recipe should reliably prevent GPU Operator operand DaemonSets from scheduling on control-plane nodes, or it should not emit unsupported values that imply this protection exists.
Specifically,
nvidia-device-plugin-daemonsetshould not run on nodes with:when using the AICR kind overlay.
Actual Behavior
AICR emits
devicePlugin.affinity,dcgmExporter.affinity,gfd.affinity, andvalidator.affinityunder GPU Operator values, but GPU Operatorv25.10.1does not render or consume these fields. The generated GPU Operator-managed device-plugin DaemonSet can still schedule on control-plane nodes when they are labeled with:Environment
kindest/node:v1.35.0Command / Request Used
Logs / Error Output
Example rendered
ClusterPolicy.spec.devicePluginfrom GPU Operatorv25.10.1:No
affinityfield is present.Example control-plane labels observed in an nvkind cluster:
Additional Context
This appears to be a mismatch between AICR's kind overlay and the GPU Operator chart/CRD API. The
devicePlugin.affinitystyle resembles the standaloneNVIDIA/k8s-device-pluginHelm chart, which supports top-level.Values.affinity, but GPU Operator does not directly use that chart for its managednvidia-device-plugin-daemonset.current
v25.10.1and upstreammaindo not appear to exposeDevicePluginSpec.AffinityinClusterPolicy.