Add gpu-e2e workflow#69
Merged
Merged
Conversation
7a38e5b to
c9e4224
Compare
47e81ee to
b6e0c2b
Compare
ArangoGutierrez
previously approved these changes
Apr 20, 2026
cdesiniotis
reviewed
Apr 20, 2026
cdesiniotis
left a comment
Contributor
There was a problem hiding this comment.
Thanks @dims! This is great. I left some questions for my understanding.
2f3a756 to
b990da4
Compare
Add a label-gated GPU e2e workflow for nvkind that runs on both
NVIDIA self-hosted runner pools (linux-amd64-gpu-t4-latest-1 and
linux-arm64-gpu-l4-latest-1).
Three scenarios per matrix job:
S1 default cluster lifecycle — nvkind cluster create, node count,
RuntimeClass presence, and a set-equality check between the
UUIDs reported by `nvkind cluster print-gpus` (JSON) and
`nvidia-smi --query-gpu=uuid` on the host.
S2 GPU Operator (minimal mode) + nvidia-smi pod — installs
`nvidia/gpu-operator` pinned to v26.3.1 with driver/toolkit/
DCGM disabled and NFD enabled (matches aicr's proven stack),
waits for the nvidia-device-plugin daemonset rollout, confirms
`nvidia.com/gpu` capacity is advertised, and runs a pod that
execs `nvidia-smi`.
S3 DRA driver + ResourceClaim + nvidia-smi pod — amd64 only.
Installs `nvidia/nvidia-dra-driver-gpu` v25.12.0 into a cluster
configured via hack/ci/templates/dra.yaml.tmpl
(DynamicResourceAllocation feature gate across control-plane
and kubelet; enable_cdi in containerd), runs a pod backed by a
ResourceClaimTemplate, asserts the pod sees exactly one GPU
from `nvidia-smi -L`, and asserts DRA actually engaged via
`pod.status.resourceClaimStatuses[name=="gpu"].resourceClaimName`
(the claim itself is pod-scoped and GC'd after Succeeded, so
checking the status is more reliable than querying the claim).
Triggers:
- push: main, pull-request/<N> (copy-pr-bot mirror), gpu-ci/**
- pull_request [labeled] with `run-gpu-tests`
- schedule: 06:00 UTC daily
- workflow_dispatch
A `check-paths` gate protects GPU runner minutes by only running when
workflow/CI/source paths change in the PR diff against main.
Artifacts collected on every run: kind export logs, per-cluster pod
list + events, docker daemon.json, nvidia-ctk config. Retained 7 days.
Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
b990da4 to
5741e80
Compare
cdesiniotis
approved these changes
Apr 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First slice of GPU end-to-end CI for nvkind. One workflow, two matrix jobs on NVIDIA self-hosted runners:
Three scenarios per matrix job:
nvkind cluster create, node count, RuntimeClass presence,nvkind cluster print-gpus.nvidia/gpu-operatorwith driver/toolkit/DCGM disabled and NFD enabled, waits for the device-plugin daemonset rollout, confirmsnvidia.com/gpucapacity is advertised, runs a pod that execsnvidia-smi.nvidia/nvidia-dra-driver-gpuv25.12.0 into a cluster configured viahack/ci/templates/dra.yaml.tmpl(DynamicResourceAllocation feature gate on control-plane + kubelet,enable_cdiin containerd), applies aResourceClaimTemplate, runs a pod that execsnvidia-smivia the claim.Triggers:
main,pull-request/<N>(copy-pr-bot mirror),gpu-ci/**[labeled]withrun-gpu-testsA
check-pathsjob gates the runner minutes on actual workflow/CI/source changes in the PR diff againstmain.