Skip to content

Unable to find GPU in pod where GPU has been successfully allocated, and nvidia-smi displays the message "Device not found" after execution. #63

@winniew-wqiong

Description

@winniew-wqiong

I used the nvkind to build a k8s cluster with two worker nodes as the example showed in a linux server.
a work node with 2 gpus and the other with 3 nodes.
Then I ininstalled a gpu-operator and everything looked fine.
After that, I tried to apply a pod via kubectl apply command.
This pod needed a gpu resource and the status of this pod was running.
Finally, I tried to log this pod to see if everything was ok and find that the nvidia-smi displayed the message "Device not found" after execution.

  1. The k8s cluster info:
    nvkind cluster print-gpus
    [
    {
    "node": "explicit-gpus-worker",
    "gpus": [
    {
    "Index": "0",
    "Name": "NVIDIA H20",
    "UUID": "GPU-3c10a6fd-a8c7-4e9a-0788-32b503a11514"
    },
    {
    "Index": "1",
    "Name": "NVIDIA H20",
    "UUID": "GPU-4fca9ef6-4ebf-f2ca-3d57-51b2dd23f146"
    }
    ]
    },
    {
    "node": "explicit-gpus-worker2",
    "gpus": [
    {
    "Index": "0",
    "Name": "NVIDIA H20",
    "UUID": "GPU-d7adcefb-078a-69d7-d3d1-3b853ce93225"
    },
    {
    "Index": "1",
    "Name": "NVIDIA H20",
    "UUID": "GPU-734f87a0-26b1-e9c7-501a-be3d8b85fb8e"
    },
    {
    "Index": "2",
    "Name": "NVIDIA H20",
    "UUID": "GPU-4917c52b-1af2-a05f-100f-d348f9cd7cf5"
    }
    ]
    }
    ]

  2. kubectl get pods
    NAME READY STATUS RESTARTS AGE
    gpu-pod 1/1 Running 0 45s

  3. kubectl logs gpu-pod
    No devices were found

4.gpu-pod yaml file
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
labels:
app: gpu-workload
spec:
nodeSelector:
accelerator: nvidia-tesla-k160 # Select nodes with GPU (adjust label as needed)
containers:
- name: gpu-container
image: nvidia/cuda:11.8.0-base-ubuntu22.04
command: ["bash", "-c"]
args: ["nvidia-smi; trap 'exit 0' TERM; sleep infinity & wait"]
resources:
requests:
nvidia.com/gpu: "1" # Request 1 GPU
limits:
nvidia.com/gpu: "1" # Limit to 1 GPU
restartPolicy: Never

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions