Our current version of gpu-operator defaults to templating in some non-existant images when using older GPUs and ubuntu24:
Warning Failed 11m (x5 over 14m) kubelet Failed to pull image "nvcr.io/nvidia/driver:550.127.05-ubuntu24.04": rpc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/driver:550.127.05-ubuntu24.04": failed to resolve reference "nvcr.io/nvidia/driver:550.127.05-ubuntu24.04": nvcr.io/nvidia/driver:550.127.05-ubuntu24.04: not found
This results in the state machine getting stuck:
gpu-feature-discovery-fl9rq 0/2 Init:0/3 0 14m
gpu-operator-788c6bf9fb-d2rjj 1/1 Running 0 15m
gpu-operator-node-feature-discovery-gc-7f6fbc9775-4xtw6 1/1 Running 0 15m
gpu-operator-node-feature-discovery-master-6ccd579c8c-8djhp 1/1 Running 0 15m
gpu-operator-node-feature-discovery-worker-kw8pm 1/1 Running 0 15m
nvidia-container-toolkit-daemonset-9ncqq 0/1 Init:0/1 0 14m
nvidia-dcgm-exporter-dtld7 0/1 Init:0/1 0 14m
nvidia-device-plugin-daemonset-96sfj 0/2 Init:0/2 0 14m
nvidia-driver-daemonset-gkn4s 0/1 ImagePullBackOff 0 15m
nvidia-operator-validator-545q8 0/1 Init:0/4 0 14m
Bumping to latest gets proper images templated - however for k3s deployments we still need to make sure the CONTAINERD_ envs are set to custom k3s paths (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html)
Our current version of gpu-operator defaults to templating in some non-existant images when using older GPUs and ubuntu24:
This results in the state machine getting stuck:
Bumping to latest gets proper images templated - however for k3s deployments we still need to make sure the CONTAINERD_ envs are set to custom k3s paths (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html)