[Bug]: nvkind gpu-operator values set unsupported devicePlugin.affinity, allowing device plugin to schedule on control-plane

### Prerequisites

- [x] I searched existing issues and found no duplicates
- [x] I can reproduce this issue consistently
- [x] This is not a security vulnerability

### Bug Description

AICR's `kind` recipe overlay attempts to keep GPU Operator operands off kind control-plane nodes by setting per-component affinity under `dcgmExporter`, `devicePlugin`, `gfd`, and `validator`.

The intent is documented in `recipes/overlays/kind.yaml`:

```yaml
# Exclude control-plane nodes — NFD labels them with GPU features
# (shared host kernel) but they lack actual device access, causing
# operand init containers to hang and ClusterPolicy to stay notReady.
dcgmExporter:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: DoesNotExist

# Use chart-default device plugin env in kind (no CDI overrides)
devicePlugin:
  env: []
  # Exclude control-plane nodes (see dcgmExporter.affinity comment)
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: DoesNotExist
```

However, for the GPU Operator version currently pinned by `recipes/overlays/base.yaml` (`v25.10.1`), these values are not rendered into `ClusterPolicy` and are not understood by the GPU Operator CRD/controller. The rendered `ClusterPolicy.spec.devicePlugin` only contains fields such as `enabled`, `repository`, `image`, `version`, `imagePullPolicy`, etc.; it does not contain `affinity`.

As a result, the generated `nvidia-device-plugin-daemonset` is based on GPU Operator's built-in operand template and effectively schedules using only:

```yaml
nodeSelector:
  nvidia.com/gpu.deploy.device-plugin: "true"
tolerations:
  - operator: Exists
```

In nvkind/kind, the control-plane container can see NVIDIA PCI devices because it is started with the NVIDIA runtime / shared host kernel view. NFD then labels the control-plane node with a NVIDIA PCI feature label, GPU Operator adds `nvidia.com/gpu.deploy.device-plugin=true`, and the DaemonSet can schedule there because `daemonsets.tolerations: [{operator: Exists}]` tolerates the control-plane `NoSchedule` taint.

### Impact

Medium. This can cause GPU Operator operands, especially `nvidia-device-plugin-daemonset`, to schedule onto kind control-plane nodes even though the AICR recipe appears to exclude them. In affected nvkind clusters this can leave operands stuck in init or keep the `ClusterPolicy` from becoming ready.

### Component

Recipe engine / data

### Regression?

Unknown / first time using this feature

### Steps to Reproduce

1. Generate/deploy a kind/nvkind AICR bundle that inherits `recipes/overlays/kind.yaml` and `recipes/overlays/base.yaml`.
2. Inspect the generated GPU Operator values for the kind overlay. They include `devicePlugin.affinity` that excludes `node-role.kubernetes.io/control-plane`.
3. Run `helm template` for the generated GPU Operator bundle:

```bash
helm template gpu-operator gpu-operator \
  --repo https://helm.ngc.nvidia.com/nvidia \
  --version v25.10.1 \
  --namespace gpu-operator \
  -f values.yaml -f cluster-values.yaml
```

4. Observe that the rendered `ClusterPolicy.spec.devicePlugin` does not include `affinity`.
5. Deploy the bundle in an nvkind cluster where the control-plane node is labeled by NFD with a NVIDIA PCI feature label.
6. Observe that GPU Operator may add `nvidia.com/gpu.deploy.device-plugin=true` to the control-plane node, allowing the generated device-plugin DaemonSet to schedule there unless it is manually patched.

### Expected Behavior

The AICR kind/nvkind recipe should reliably prevent GPU Operator operand DaemonSets from scheduling on control-plane nodes, or it should not emit unsupported values that imply this protection exists.

Specifically, `nvidia-device-plugin-daemonset` should not run on nodes with:

```text
node-role.kubernetes.io/control-plane
```

when using the AICR kind overlay.

### Actual Behavior

AICR emits `devicePlugin.affinity`, `dcgmExporter.affinity`, `gfd.affinity`, and `validator.affinity` under GPU Operator values, but GPU Operator `v25.10.1` does not render or consume these fields. The generated GPU Operator-managed device-plugin DaemonSet can still schedule on control-plane nodes when they are labeled with:

```text
nvidia.com/gpu.deploy.device-plugin=true
```

### Environment

- AICR version: current repository state
- Install method: bundle generated from AICR recipes
- Platform: kind / nvkind
- Kubernetes version: observed with kind node image `kindest/node:v1.35.0`
- OS: Ubuntu 22.04 host, Debian 12 kind node image
- Kernel version: 5.15.0-177-generic
- GPU type: observed on a Tesla P4 nvkind host; issue is recipe/chart-version behavior rather than GPU-model specific
- Workload intent: kind training/inference recipes

### Command / Request Used

```bash
helm template gpu-operator gpu-operator \
  --repo https://helm.ngc.nvidia.com/nvidia \
  --version v25.10.1 \
  --namespace gpu-operator \
  -f values.yaml -f cluster-values.yaml
```

### Logs / Error Output

Example rendered `ClusterPolicy.spec.devicePlugin` from GPU Operator `v25.10.1`:

```yaml
devicePlugin:
  enabled: true
  repository: nvcr.io/nvidia
  image: k8s-device-plugin
  version: "v0.18.1"
  imagePullPolicy: IfNotPresent
```

No `affinity` field is present.

Example control-plane labels observed in an nvkind cluster:

```text
feature.node.kubernetes.io/pci-0302_10de.present=true
nvidia.com/gpu.present=true
nvidia.com/gpu.deploy.device-plugin=true
```

### Additional Context

This appears to be a mismatch between AICR's kind overlay and the GPU Operator chart/CRD API. The `devicePlugin.affinity` style resembles the standalone `NVIDIA/k8s-device-plugin` Helm chart, which supports top-level `.Values.affinity`, but GPU Operator does not directly use that chart for its managed `nvidia-device-plugin-daemonset`.

current `v25.10.1` and upstream `main` do not appear to expose `DevicePluginSpec.Affinity` in `ClusterPolicy`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: nvkind gpu-operator values set unsupported devicePlugin.affinity, allowing device plugin to schedule on control-plane #982

Prerequisites

Bug Description

Impact

Component

Regression?

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Command / Request Used

Logs / Error Output

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: nvkind gpu-operator values set unsupported devicePlugin.affinity, allowing device plugin to schedule on control-plane #982

Description

Prerequisites

Bug Description

Impact

Component

Regression?

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Command / Request Used

Logs / Error Output

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions