chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump)

## Summary

PR #965 mitigates the stale-NVML class of bug by hard-coding the gpu-operator chart version into a DRA pod-template annotation. The annotation works as long as it's kept in lockstep with the gpu-operator chart pin — which depends on a maintainer remembering to bump both in the same PR.

That's brittle. A future gpu-operator bump that forgets the annotation will silently re-introduce the original failure (stale NVML, DRA `FailedPrepareDynamicResources`, GPU workloads stuck in `ContainerCreating`).

This issue tracks the **durable** fix.

## Pre-PR-#965 repro (context)

After upgrading gpu-operator from v25.10.1 (driver `580.105.08`) → v26.3.1 (driver `580.126.20`), the `nvidia-dra-driver-gpu` kubelet-plugin DaemonSet pod is not restarted by `helm upgrade`. Its in-pod NVML handle remains bound to the old driver, and DRA allocation fails with:

```
FailedPrepareDynamicResources: Failed to prepare dynamic resources:
  NodePrepareResources failed for ResourceClaim ...:
  unable to create CDI spec file for claim: ...:
  failed to initialize NVML: Driver/library version mismatch
```

Workloads that request GPUs via DRA (Dynamo `DynamoGraphDeployment`, `aicr validate --phase performance` `inference-perf`, etc.) stay in `ContainerCreating` indefinitely. Manual workaround pre-#965: `kubectl delete pod -n nvidia-dra-driver -l app.kubernetes.io/name=nvidia-dra-driver-gpu`.

## What's already in place (don't reinvent)

`deploy.sh.tmpl` ([pkg/bundler/deployer/helm/templates/deploy.sh.tmpl:354](https://github.com/NVIDIA/aicr/blob/main/pkg/bundler/deployer/helm/templates/deploy.sh.tmpl#L354)) already emits a post-install `kubectl rollout restart` for the DRA kubelet-plugin DaemonSet, gated on `name == "nvidia-dra-driver-gpu"`. So:

- ✅ Users running the bundled `./deploy.sh` (**default Helm deployer**) get the restart automatically.
- ❌ Users consuming `helmfile`, `flux`, `argocd`, or `argocd-helm` artifacts — there's no equivalent in those deployer templates (`grep -rn "kubelet-plugin" pkg/bundler/deployer/{helmfile,flux,argocd*}/` returns zero matches).
- ❌ Users running `helm upgrade` directly on the upstream chart without going through AICR — out of scope for AICR; only an upstream chart change can cover them.

(`localformat` is the shared folder writer that all deployers compose, not a user-facing deployer itself.)

PR #965's annotation closes the GitOps-deployer gap by making the pod-template change visible in every deployer's rendered artifact — they all re-roll the DaemonSet on next reconcile. The durability question is how to keep the annotation correct across future chart bumps.

## Why PR #965's mitigation isn't durable

```yaml
# recipes/components/nvidia-dra-driver-gpu/values.yaml
kubeletPlugin:
  podAnnotations:
    aicr.nvidia.com/gpu-operator-chart-version: v26.3.1   # ← must be bumped manually
```

- The annotation value is a free-form string. Helm has no awareness it's supposed to mirror `gpu-operator`'s chart version.
- A future PR that bumps `gpu-operator` (e.g. v26.4.0) but forgets the annotation produces identical rendered DaemonSet manifests for `nvidia-dra-driver-gpu`. `helm upgrade` skips the roll. Stale NVML returns.
- `make qualify` does not catch this — no static or runtime check verifies the two pins are coherent.

## Recommended durable fix: bundler-derived annotation

**Inject the annotation at bundle-generation time from the resolved recipe**, not at hand-edit time.

Concrete shape:

- **Source**: read the `gpu-operator` `ComponentRef.Version` from the resolved recipe (the same value the resolver writes into the bundle for the gpu-operator chart).
- **Destination**: inject into both
  - `controller.podAnnotations.aicr.nvidia.com/gpu-operator-chart-version`
  - `kubeletPlugin.podAnnotations.aicr.nvidia.com/gpu-operator-chart-version`
  of the `nvidia-dra-driver-gpu` rendered values.
- **Trigger condition**: only inject when **both** `gpu-operator` and `nvidia-dra-driver-gpu` are *enabled* in the **filtered resolved recipe**. `DefaultBundler.Make` ([`pkg/bundler/bundler.go:213-254`](https://github.com/NVIDIA/aicr/blob/main/pkg/bundler/bundler.go#L213)) builds an `enabledRefs` slice that excludes disabled components before any value extraction; the helper should consume that filtered `recipeResult.ComponentRefs`, not the unfiltered recipe.
- **Injection point**: in `DefaultBundler.Make`, **after** `extractComponentValues` ([line 277](https://github.com/NVIDIA/aicr/blob/main/pkg/bundler/bundler.go#L277)) has loaded recipe values, applied user `--set` overrides, and applied scheduling overrides — but **before** `buildDeployer` ([line 311](https://github.com/NVIDIA/aicr/blob/main/pkg/bundler/bundler.go#L311)). This window guarantees:
  - every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives the same final `componentValues` map;
  - a user `--set gpuoperator:rollout-trigger=foo` does **not** accidentally defeat the internal rollout annotation (unless AICR intentionally decides to allow that, which would be a separate explicit knob).
- **Maintainer-facing values file** in `recipes/components/nvidia-dra-driver-gpu/values.yaml` then carries no chart-version string at all — that data lives only in the resolved recipe.

Why this is the right shape:
- **Works for every deployer.** All deployers consume `componentValues` from the same code path, so all of them get the pod-template change.
- **Single source of truth.** The annotation is derived from the resolved recipe (which is exactly the source #966 is converging on for chart versions), so there's no separate string to maintain.
- **Bisects cleanly.** A maintainer who only edits `gpu-operator`'s pin gets the DRA annotation update for free in the bundle output.

## Near-term guardrail (safe to ship while the bundler-derived fix is in flight)

A static coherence check, ideally wired into `make qualify` or a new `make check-dra-rollout-trigger`:

- Compare the **resolved recipe's** `gpu-operator` `ComponentRef.Version` against the annotation value in `recipes/components/nvidia-dra-driver-gpu/values.yaml`. Fail if they differ.
- **Compare against the resolved recipe, not against `recipes/components/gpu-operator/values.yaml` or `helm.defaultVersion` in the registry.** This aligns with #966 (resolved recipe as single source of truth for chart versions) and avoids creating a new drift point.

This is the cheapest guardrail and is worth shipping immediately if the bundler-derived fix isn't ready for the same release.

## Definition of done

- [ ] **Primary**: `DefaultBundler.Make` injects `aicr.nvidia.com/gpu-operator-chart-version` (or equivalent annotation) into the rendered `nvidia-dra-driver-gpu` `componentValues` for both `controller` and `kubeletPlugin`, sourced from the resolved gpu-operator `ComponentRef.Version`, gated on both refs being enabled in the filtered recipe, injected after `extractComponentValues` and before `buildDeployer`. Maintainer-facing values file no longer carries the chart-version string.
- [ ] **Acceptance test (generated-artifact parity)**: bump the gpu-operator version in a fixture; assert the rendered DRA `podAnnotations` change in at least the **default Helm-deployer** output AND one GitOps-deployer output (helmfile or flux or argocd). Catches the cross-deployer parity risk.
- [ ] **Acceptance test (disabled-component negative case)**: with `nvidia-dra-driver-gpu` *disabled* in the recipe, assert the helper does not inject the annotation anywhere AND does not emit a warning. Proves the gating is correct.
- [ ] **Near-term guardrail (optional, ship if bundler change slips)**: `make` target that fails when the resolved gpu-operator chart version diverges from the DRA annotation. Compares against resolved recipe (not registry default).
- [ ] **Cluster verification**: on a cluster running an older gpu-operator, `aicr bundle` + deploy via the chosen deployer rolls the DRA pods automatically, and `aicr validate --phase performance` passes without manual intervention.

## Out of scope

- Bumping `nvidia-dra-driver-gpu` itself (still v25.12.0; not driver-version-bound in its registry pin).
- Coverage for users running the upstream `k8s-dra-driver-gpu` Helm chart directly without AICR artifacts — only an upstream chart change can cover that population. Worth a separate upstream conversation, not AICR-tracked.

## Related

- #894 — gpu-operator chart bump that surfaced this class of bug
- #965 — PR that ships the v26.3.1 bump AND the (manual-coupling) annotation mitigation
- #966 — separate refactor (resolved recipe as single source of truth for chart versions); the bundler-derived annotation slots into that work cleanly

---

**Credit:** v2 reframing (issue is a durability question, not the original bug) from Codex review of v1. v3 refinements (deploy.sh hook already present, bundler-derived annotation as the right primary, resolved-recipe basis for the coherence check, generated-artifact tests, upstream-chart-direct caveat) from Codex review of v2. v4 refinements (filtered-recipe gating, exact injection point between `extractComponentValues` and `buildDeployer`, "default Helm deployer" terminology, disabled-component negative test) from Codex review of v3.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973

Summary

Pre-PR-#965 repro (context)

What's already in place (don't reinvent)

Why PR #965's mitigation isn't durable

Recommended durable fix: bundler-derived annotation

Near-term guardrail (safe to ship while the bundler-derived fix is in flight)

Definition of done

Out of scope

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973

Description

Summary

Pre-PR-#965 repro (context)

What's already in place (don't reinvent)

Why PR #965's mitigation isn't durable

Recommended durable fix: bundler-derived annotation

Near-term guardrail (safe to ship while the bundler-derived fix is in flight)

Definition of done

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions