Unable to find GPU in pod where GPU has been successfully allocated, and nvidia-smi displays the message "Device not found" after execution.

I used the nvkind to build a k8s cluster with two worker nodes as the example showed in a linux server.
a work node with 2 gpus and the other with 3 nodes.
Then I ininstalled a gpu-operator and everything looked fine.
After that, I tried to apply a pod via kubectl apply command.
This pod needed a gpu resource and the status of this pod was running.
Finally, I tried to log this pod to see if everything was ok and find that the nvidia-smi displayed the message "Device not found" after execution.

1. **The k8s cluster info:**
nvkind cluster print-gpus
[
    {
        "node": "explicit-gpus-worker",
        "gpus": [
            {
                "Index": "0",
                "Name": "NVIDIA H20",
                "UUID": "GPU-3c10a6fd-a8c7-4e9a-0788-32b503a11514"
            },
            {
                "Index": "1",
                "Name": "NVIDIA H20",
                "UUID": "GPU-4fca9ef6-4ebf-f2ca-3d57-51b2dd23f146"
            }
        ]
    },
    {
        "node": "explicit-gpus-worker2",
        "gpus": [
            {
                "Index": "0",
                "Name": "NVIDIA H20",
                "UUID": "GPU-d7adcefb-078a-69d7-d3d1-3b853ce93225"
            },
            {
                "Index": "1",
                "Name": "NVIDIA H20",
                "UUID": "GPU-734f87a0-26b1-e9c7-501a-be3d8b85fb8e"
            },
            {
                "Index": "2",
                "Name": "NVIDIA H20",
                "UUID": "GPU-4917c52b-1af2-a05f-100f-d348f9cd7cf5"
            }
        ]
    }
]

2. kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
gpu-pod     1/1     Running   0          45s


3. kubectl logs gpu-pod
No devices were found

4.gpu-pod yaml file
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
  labels:
    app: gpu-workload
spec:
  nodeSelector:
    accelerator: nvidia-tesla-k160        # Select nodes with GPU (adjust label as needed)
  containers:
    - name: gpu-container
      image: nvidia/cuda:11.8.0-base-ubuntu22.04
      command: ["bash", "-c"]
      args: ["nvidia-smi; trap 'exit 0' TERM; sleep infinity & wait"]
      resources:
        requests:
          nvidia.com/gpu: "1"            # Request 1 GPU
        limits:
          nvidia.com/gpu: "1"            # Limit to 1 GPU
  restartPolicy: Never

 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to find GPU in pod where GPU has been successfully allocated, and nvidia-smi displays the message "Device not found" after execution. #63

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Unable to find GPU in pod where GPU has been successfully allocated, and nvidia-smi displays the message "Device not found" after execution. #63

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions