Skip to content

Pod workload deployment step is throwing "executable file not found" error for nvidia-smi #31

@aoyshi

Description

@aoyshi

I am able to follow all the steps in the readme and get the expected responses, until the very last pod workload step of:

cat << EOF | kubectl --context=kind-${KIND_CLUSTER_NAME} apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["nvidia-smi", "-L"]
    resources:
      limits:
        nvidia.com/gpu: 2
EOF

I get an error in the container (which never starts, so I cannot exec -it into into to investigate):

failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown.

I have the nvidia-device-plugin pods working correctly. Every step of the setup has worked so far until this very last one. Am I missing something?

More info:

I used this command to spin up the nvidia device plugin pods:

helm upgrade -i \
  --kube-context=kind-${KIND_CLUSTER_NAME} \
  --namespace nvidia \
  --set gfd.enabled=true \
  --set runtimeClassName=nvidia \
  --set deviceListStrategy=volume-mounts \
  --set deviceDiscoveryStrategy=nvml \
  --create-namespace \
  nvidia-device-plugin nvdp/nvidia-device-plugin

Output:

$ kubectl --context=kind-${KIND_CLUSTER_NAME} get pod -n nvidia

NAME                                                                                                      READY   STATUS    RESTARTS        AGE
nvidia-device-plugin-gpu-feature-discovery-r6t7s                                 1/1     Running   0               5m44s
nvidia-device-plugin-lsqvc                                                                      1/1     Running   0               5m44s
nvidia-device-plugin-node-feature-discovery-master-77b96ddqcxlw   1/1     Running   0               5m44s
nvidia-device-plugin-node-feature-discovery-worker-5n8nc                 1/1     Running   1 (5m23s ago)   5m44s
$ kubectl --context=kind-${KIND_CLUSTER_NAME} get nodes -o json | jq -r '.items[] | select(.metadata.name | test("-worker[0-9]*$")) | {name: .metadata.name, "nvidia.com/gpu": .status.allocatable["nvidia.com/gpu"]}'

{
  "name": "nvkind-vknmz-worker",
  "nvidia.com/gpu": "1"
}

Cluster GPUs: cluster was created using ./nvkind cluster create

$ ./nvkind cluster print-gpus

[
    {
        "node": "nvkind-vknmz-worker",
        "gpus": [
            {
                "Index": "0",
                "Name": "NVIDIA A100 80GB PCIe",
                "UUID": "GPU-a80141e3-fabe-57be-6d71-7c1b39e79553"
            }
        ]
    }
]

Error:

$ kubectl describe pod gpu-test

Name:             gpu-test
Namespace:        default
Priority:         0
Service Account:  default
Node:             nvkind-vknmz-worker
Start Time:       Fri, 28 Feb 2025 14:26:13 +0000
Labels:           <none>
Annotations:      <none>
Status:           Running
Containers:
  ctr:
    Container ID:  containerd://1f6e8d3be9c54fdcb182f57320e925bcb1d64e408ea04b17740022cac04d87c0
    Image:         ubuntu:22.04
    Image ID:      docker.io/library/ubuntu@sha256:ed1544e454989078f5dec1bfdabd8c5cc9c48e0705d07b678ab6ae3fb61952d2
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-smi
      -L
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "nvidia-smi": executable file not found in $PATH
      Exit Code:    128
      Started:      Thu, 01 Jan 1970 00:00:00 +0000
      Finished:     Fri, 28 Feb 2025 14:32:02 +0000
    Ready:          False
    Restart Count:  6
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-stvml (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  kube-api-access-stvml:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  8m6s                   default-scheduler  Successfully assigned default/gpu-test to nvkind-vknmz-worker
  Normal   Pulling    8m5s                   kubelet            Pulling image "ubuntu:22.04"
  Normal   Pulled     8m3s                   kubelet            Successfully pulled image "ubuntu:22.04" in 1.715s (1.715s including waiting). Image size: 29545350 bytes.
  Warning  Failed     5m7s (x6 over 8m3s)    kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "nvidia-smi": executable file not found in $PATH
  Warning  BackOff    2m54s (x25 over 8m1s)  kubelet            Back-off restarting failed container ctr in pod gpu-test_default(94b2cc75-8ef2-4df6-98d9-2f4015a0dd3e)
  Normal   Created    2m17s (x7 over 8m3s)   kubelet            Created container: ctr
  Normal   Pulled     2m17s (x6 over 8m2s)   kubelet            Container image "ubuntu:22.04" already present on machine

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions