Skip to content

fix(gpu): add Tegra/Jetson GPU support#625

Open
elezar wants to merge 6 commits intomainfrom
fix/tegra-gpu-support
Open

fix(gpu): add Tegra/Jetson GPU support#625
elezar wants to merge 6 commits intomainfrom
fix/tegra-gpu-support

Conversation

@elezar
Copy link
Copy Markdown
Member

@elezar elezar commented Mar 26, 2026

Summary

Adds GPU support for NVIDIA Tegra/Jetson platforms by bind-mounting the
host-files configuration directory, updating the device plugin image, and
preserving CDI-injected GIDs across privilege drop.

Related Issue

Part of #398 (CDI injection). Depends on #568 (Tegra system support). Should be merged after #495 and #503.

Upstream PRs:

Changes

  • Bind-mount /etc/nvidia-container-runtime/host-files-for-container.d (read-only) into the gateway container when present, so the nvidia runtime inside k3s applies the same host-file injection config as the host — required for Jetson/Tegra CDI spec generation
  • Pin k8s-device-plugin to an image that supports host-files bind-mounts and generates additionalGids in the CDI spec (GID 44 / video, required for /dev/nvmap access on Tegra)
  • Preserve CDI-injected supplemental GIDs across initgroups() during privilege drop, so exec'd processes retain access to GPU devices
  • Fall back to /usr/sbin/nvidia-smi in the GPU e2e test for Tegra systems where nvidia-smi is not on the default PATH

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

elezar added 4 commits March 26, 2026 14:44
Bind-mount /etc/nvidia-container-runtime/host-files-for-container.d
(read-only) into the gateway container when it exists, so the nvidia
runtime running inside k3s can apply the same host-file injection
config as on the host — required for Jetson/Tegra platforms.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Use ghcr.io/nvidia/k8s-device-plugin:2ab68c16 which includes support for
mounting /etc/nvidia-container-runtime/host-files-for-container.d into the
device plugin pod, required for correct CDI spec generation on Tegra-based
systems.

Also included is an nvcdi API bump that ensures that additional GIDs are
included in the generated CDI spec.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
initgroups(3) replaces all supplemental groups with the user's entries
from /etc/group, discarding GIDs injected by the container runtime via
CDI (e.g. GID 44/video needed for /dev/nvmap on Tegra). Snapshot the
container-level GIDs before initgroups runs and merge them back
afterwards, excluding GID 0 (root) to avoid privilege retention.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
On Jetson/Tegra platforms nvidia-smi is installed at /usr/sbin/nvidia-smi
rather than /usr/bin/nvidia-smi and may not be on PATH inside the sandbox.
Fall back to the full path when the bare command is not found.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar self-assigned this Mar 26, 2026
@elezar
Copy link
Copy Markdown
Member Author

elezar commented Mar 26, 2026

cc @johnnynunez

@johnnynunez
Copy link
Copy Markdown

johnnynunez commented Mar 26, 2026

LGTM @elezar
ready to merge @johntmyers

@elezar
Copy link
Copy Markdown
Member Author

elezar commented Mar 27, 2026

This was only tested in conjunction with #495 and #503. Once those are in, there should be no reason to not get this in too.

@elezar elezar marked this pull request as ready for review March 27, 2026 07:16
@elezar elezar requested a review from a team as a code owner March 27, 2026 07:16
@johnnynunez
Copy link
Copy Markdown

This

Yes, i know. I was tracking it. And tested

@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 1, 2026

I dug into the GID-preservation change here and I think PR #710 may make it unnecessary.

What I verified locally:

  • Inside a running sandbox, the GPU device nodes are owned by sandbox:sandbox after supervisor setup.
  • The corresponding host and k3s-container device nodes remain root:root 666, so the sandbox-side chown() does not appear to mutate the host devices.
  • That suggests these are container-local CDI-created device nodes, not direct host bind mounts.

If that holds generally, then once #710 adds the needed GPU device paths to filesystem.read_write, prepare_filesystem() will chown(path, uid, gid) before privilege drop and DAC access should come from ownership rather than from preserving CDI-injected supplemental groups.

So I think we should re-check whether the drop_privileges() GID merge is still needed after #710 lands. It may be removable if all required GPU paths (including Tegra-specific ones like /dev/nvmap if applicable) are present and successfully chowned.

@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 1, 2026

Follow-up: I removed the checked-in custom ghcr.io/nvidia/k8s-device-plugin:2ab68c16 image override from this branch.

If someone still needs that image on a live gateway for testing, they can patch the running cluster in place:

openshell doctor exec -- kubectl -n kube-system patch helmchart nvidia-device-plugin --type merge -p '{
  "spec": {
    "valuesContent": "image:\n  repository: ghcr.io/nvidia/k8s-device-plugin\n  tag: \"2ab68c16\"\nruntimeClassName: nvidia\ndeviceListStrategy: cdi-cri\ndeviceIDStrategy: index\ncdi:\n  nvidiaHookPath: /usr/bin/nvidia-cdi-hook\nnvidiaDriverRoot: \"/\"\ngfd:\n  enabled: false\nnfd:\n  enabled: false\naffinity: null\n"
  }
}'
openshell doctor exec -- kubectl -n nvidia-device-plugin rollout status ds/nvidia-device-plugin
openshell doctor exec -- kubectl -n nvidia-device-plugin get ds nvidia-device-plugin -o jsonpath='{.spec.template.spec.containers[0].image}{"\\n"}'

That only affects the running gateway. Recreating the gateway reapplies the checked-in manifest.

@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 1, 2026

Once the #710 is reviewed and merged, I will add it to here and test it again. I'm getting a lease on colossus for a Jetson-based system.

It's very likely there will be some updates to the policy required, with the #677 now merged and before that, in many contexts landlock policies were not correctly applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants