I used the nvkind to build a k8s cluster with two worker nodes as the example showed in a linux server.
a work node with 2 gpus and the other with 3 nodes.
Then I ininstalled a gpu-operator and everything looked fine.
After that, I tried to apply a pod via kubectl apply command.
This pod needed a gpu resource and the status of this pod was running.
Finally, I tried to log this pod to see if everything was ok and find that the nvidia-smi displayed the message "Device not found" after execution.
-
The k8s cluster info:
nvkind cluster print-gpus
[
{
"node": "explicit-gpus-worker",
"gpus": [
{
"Index": "0",
"Name": "NVIDIA H20",
"UUID": "GPU-3c10a6fd-a8c7-4e9a-0788-32b503a11514"
},
{
"Index": "1",
"Name": "NVIDIA H20",
"UUID": "GPU-4fca9ef6-4ebf-f2ca-3d57-51b2dd23f146"
}
]
},
{
"node": "explicit-gpus-worker2",
"gpus": [
{
"Index": "0",
"Name": "NVIDIA H20",
"UUID": "GPU-d7adcefb-078a-69d7-d3d1-3b853ce93225"
},
{
"Index": "1",
"Name": "NVIDIA H20",
"UUID": "GPU-734f87a0-26b1-e9c7-501a-be3d8b85fb8e"
},
{
"Index": "2",
"Name": "NVIDIA H20",
"UUID": "GPU-4917c52b-1af2-a05f-100f-d348f9cd7cf5"
}
]
}
]
-
kubectl get pods
NAME READY STATUS RESTARTS AGE
gpu-pod 1/1 Running 0 45s
-
kubectl logs gpu-pod
No devices were found
4.gpu-pod yaml file
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
labels:
app: gpu-workload
spec:
nodeSelector:
accelerator: nvidia-tesla-k160 # Select nodes with GPU (adjust label as needed)
containers:
- name: gpu-container
image: nvidia/cuda:11.8.0-base-ubuntu22.04
command: ["bash", "-c"]
args: ["nvidia-smi; trap 'exit 0' TERM; sleep infinity & wait"]
resources:
requests:
nvidia.com/gpu: "1" # Request 1 GPU
limits:
nvidia.com/gpu: "1" # Limit to 1 GPU
restartPolicy: Never
I used the nvkind to build a k8s cluster with two worker nodes as the example showed in a linux server.
a work node with 2 gpus and the other with 3 nodes.
Then I ininstalled a gpu-operator and everything looked fine.
After that, I tried to apply a pod via kubectl apply command.
This pod needed a gpu resource and the status of this pod was running.
Finally, I tried to log this pod to see if everything was ok and find that the nvidia-smi displayed the message "Device not found" after execution.
The k8s cluster info:
nvkind cluster print-gpus
[
{
"node": "explicit-gpus-worker",
"gpus": [
{
"Index": "0",
"Name": "NVIDIA H20",
"UUID": "GPU-3c10a6fd-a8c7-4e9a-0788-32b503a11514"
},
{
"Index": "1",
"Name": "NVIDIA H20",
"UUID": "GPU-4fca9ef6-4ebf-f2ca-3d57-51b2dd23f146"
}
]
},
{
"node": "explicit-gpus-worker2",
"gpus": [
{
"Index": "0",
"Name": "NVIDIA H20",
"UUID": "GPU-d7adcefb-078a-69d7-d3d1-3b853ce93225"
},
{
"Index": "1",
"Name": "NVIDIA H20",
"UUID": "GPU-734f87a0-26b1-e9c7-501a-be3d8b85fb8e"
},
{
"Index": "2",
"Name": "NVIDIA H20",
"UUID": "GPU-4917c52b-1af2-a05f-100f-d348f9cd7cf5"
}
]
}
]
kubectl get pods
NAME READY STATUS RESTARTS AGE
gpu-pod 1/1 Running 0 45s
kubectl logs gpu-pod
No devices were found
4.gpu-pod yaml file
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
labels:
app: gpu-workload
spec:
nodeSelector:
accelerator: nvidia-tesla-k160 # Select nodes with GPU (adjust label as needed)
containers:
- name: gpu-container
image: nvidia/cuda:11.8.0-base-ubuntu22.04
command: ["bash", "-c"]
args: ["nvidia-smi; trap 'exit 0' TERM; sleep infinity & wait"]
resources:
requests:
nvidia.com/gpu: "1" # Request 1 GPU
limits:
nvidia.com/gpu: "1" # Limit to 1 GPU
restartPolicy: Never