Skip to content

[Bug]: nvkind gpu-operator values set unsupported devicePlugin.affinity, allowing device plugin to schedule on control-plane #982

Description

@carlory

Prerequisites

  • I searched existing issues and found no duplicates
  • I can reproduce this issue consistently
  • This is not a security vulnerability

Bug Description

AICR's kind recipe overlay attempts to keep GPU Operator operands off kind control-plane nodes by setting per-component affinity under dcgmExporter, devicePlugin, gfd, and validator.

The intent is documented in recipes/overlays/kind.yaml:

# Exclude control-plane nodes — NFD labels them with GPU features
# (shared host kernel) but they lack actual device access, causing
# operand init containers to hang and ClusterPolicy to stay notReady.
dcgmExporter:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: DoesNotExist

# Use chart-default device plugin env in kind (no CDI overrides)
devicePlugin:
  env: []
  # Exclude control-plane nodes (see dcgmExporter.affinity comment)
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: DoesNotExist

However, for the GPU Operator version currently pinned by recipes/overlays/base.yaml (v25.10.1), these values are not rendered into ClusterPolicy and are not understood by the GPU Operator CRD/controller. The rendered ClusterPolicy.spec.devicePlugin only contains fields such as enabled, repository, image, version, imagePullPolicy, etc.; it does not contain affinity.

As a result, the generated nvidia-device-plugin-daemonset is based on GPU Operator's built-in operand template and effectively schedules using only:

nodeSelector:
  nvidia.com/gpu.deploy.device-plugin: "true"
tolerations:
  - operator: Exists

In nvkind/kind, the control-plane container can see NVIDIA PCI devices because it is started with the NVIDIA runtime / shared host kernel view. NFD then labels the control-plane node with a NVIDIA PCI feature label, GPU Operator adds nvidia.com/gpu.deploy.device-plugin=true, and the DaemonSet can schedule there because daemonsets.tolerations: [{operator: Exists}] tolerates the control-plane NoSchedule taint.

Impact

Medium. This can cause GPU Operator operands, especially nvidia-device-plugin-daemonset, to schedule onto kind control-plane nodes even though the AICR recipe appears to exclude them. In affected nvkind clusters this can leave operands stuck in init or keep the ClusterPolicy from becoming ready.

Component

Recipe engine / data

Regression?

Unknown / first time using this feature

Steps to Reproduce

  1. Generate/deploy a kind/nvkind AICR bundle that inherits recipes/overlays/kind.yaml and recipes/overlays/base.yaml.
  2. Inspect the generated GPU Operator values for the kind overlay. They include devicePlugin.affinity that excludes node-role.kubernetes.io/control-plane.
  3. Run helm template for the generated GPU Operator bundle:
helm template gpu-operator gpu-operator \
  --repo https://helm.ngc.nvidia.com/nvidia \
  --version v25.10.1 \
  --namespace gpu-operator \
  -f values.yaml -f cluster-values.yaml
  1. Observe that the rendered ClusterPolicy.spec.devicePlugin does not include affinity.
  2. Deploy the bundle in an nvkind cluster where the control-plane node is labeled by NFD with a NVIDIA PCI feature label.
  3. Observe that GPU Operator may add nvidia.com/gpu.deploy.device-plugin=true to the control-plane node, allowing the generated device-plugin DaemonSet to schedule there unless it is manually patched.

Expected Behavior

The AICR kind/nvkind recipe should reliably prevent GPU Operator operand DaemonSets from scheduling on control-plane nodes, or it should not emit unsupported values that imply this protection exists.

Specifically, nvidia-device-plugin-daemonset should not run on nodes with:

node-role.kubernetes.io/control-plane

when using the AICR kind overlay.

Actual Behavior

AICR emits devicePlugin.affinity, dcgmExporter.affinity, gfd.affinity, and validator.affinity under GPU Operator values, but GPU Operator v25.10.1 does not render or consume these fields. The generated GPU Operator-managed device-plugin DaemonSet can still schedule on control-plane nodes when they are labeled with:

nvidia.com/gpu.deploy.device-plugin=true

Environment

  • AICR version: current repository state
  • Install method: bundle generated from AICR recipes
  • Platform: kind / nvkind
  • Kubernetes version: observed with kind node image kindest/node:v1.35.0
  • OS: Ubuntu 22.04 host, Debian 12 kind node image
  • Kernel version: 5.15.0-177-generic
  • GPU type: observed on a Tesla P4 nvkind host; issue is recipe/chart-version behavior rather than GPU-model specific
  • Workload intent: kind training/inference recipes

Command / Request Used

helm template gpu-operator gpu-operator \
  --repo https://helm.ngc.nvidia.com/nvidia \
  --version v25.10.1 \
  --namespace gpu-operator \
  -f values.yaml -f cluster-values.yaml

Logs / Error Output

Example rendered ClusterPolicy.spec.devicePlugin from GPU Operator v25.10.1:

devicePlugin:
  enabled: true
  repository: nvcr.io/nvidia
  image: k8s-device-plugin
  version: "v0.18.1"
  imagePullPolicy: IfNotPresent

No affinity field is present.

Example control-plane labels observed in an nvkind cluster:

feature.node.kubernetes.io/pci-0302_10de.present=true
nvidia.com/gpu.present=true
nvidia.com/gpu.deploy.device-plugin=true

Additional Context

This appears to be a mismatch between AICR's kind overlay and the GPU Operator chart/CRD API. The devicePlugin.affinity style resembles the standalone NVIDIA/k8s-device-plugin Helm chart, which supports top-level .Values.affinity, but GPU Operator does not directly use that chart for its managed nvidia-device-plugin-daemonset.

current v25.10.1 and upstream main do not appear to expose DevicePluginSpec.Affinity in ClusterPolicy.

Metadata

Metadata

Assignees

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions