Skip to content

feat(agent): update Dockerfile for NVIDIA agent (#2002)#2003

Open
DQ-Kwon wants to merge 3 commits into
henrygd:mainfrom
DQ-Kwon:feature/lightweight-nvidia-agent
Open

feat(agent): update Dockerfile for NVIDIA agent (#2002)#2003
DQ-Kwon wants to merge 3 commits into
henrygd:mainfrom
DQ-Kwon:feature/lightweight-nvidia-agent

Conversation

@DQ-Kwon
Copy link
Copy Markdown

@DQ-Kwon DQ-Kwon commented May 14, 2026

📃 Description

Feature #2002 Optimized the agent-nvidia image for size and multi-arch support. By switching to a Distroless base and mounting nvidia-smi from the host, the image size is reduced by 75%.

🪵 Changelog

➕ Added

  • Multi-arch support: amd64, arm64, arm/v7.
  • Dynamic library tracking for smartctl on Distroless.

✏️ Changed

  • Base image: ubuntu-cuda → distroless/base-debian12.
  • Reduced uncompressed size by ~270MB.

🗑️ Removed

  • Bundled CUDA layers and shell (via Distroless).

@DQ-Kwon DQ-Kwon requested a review from henrygd as a code owner May 14, 2026 13:48
@DQ-Kwon DQ-Kwon force-pushed the feature/lightweight-nvidia-agent branch from 235bd80 to 264dac8 Compare May 14, 2026 14:57
@svenvg93
Copy link
Copy Markdown
Collaborator

This will be a major breaking changes right? As the user need the SMI tool on the host and added to the docker compose files for it to work ?

@DQ-Kwon
Copy link
Copy Markdown
Author

DQ-Kwon commented May 14, 2026

Hi @svenvg93, thanks for the feedback. I agree this is a major change, and I see your point.

In the NVIDIA ecosystem, having nvidia-smi on the host is practically a standard requirement. Much like mounting docker.sock, I believe mounting the SMI tool is a more lightweight and practical approach for monitoring.

As for the base image, I chose Debian over Alpine because NVIDIA's official binaries are built for glibc. In my experience, Debian is far more stable than Alpine’s musl environment for these GPU tasks.

Regarding the changes to the compose file, I’m attaching the updated compose.yml below for reference:

services:
  beszel-agent:
    image: henrygd/beszel-agent-nvidia:slim
    container_name: beszel-agent
    restart: unless-stopped
    network_mode: host
    gpus: all

    volumes:
      - ./beszel_agent_data:/var/lib/beszel-agent
      - /var/run/docker.sock:/var/run/docker.sock:ro
      # If using WSL, the path might be: /usr/lib/wsl/lib/nvidia-smi
      - /usr/bin/nvidia-smi:/usr/bin/nvidia-smi:ro

    environment:
      LISTEN: 45876
      KEY: "<public key>"
      HUB_URL: "<hub url>"
      TOKEN: "<token>"

      GPU_COLLECTOR: nvidia-smi
      NVIDIA_VISIBLE_DEVICES: all
      NVIDIA_DRIVER_CAPABILITIES: compute,utility

Note: Technically, the gpus: all option should allow the NVIDIA Container Toolkit to handle library injections automatically. However, I’ve included the explicit mount for nvidia-smi to ensure availability across different environments. In my tests, it worked even when commented out, but I've kept it as a safeguard for broader compatibility.

@svenvg93
Copy link
Copy Markdown
Collaborator

Hi @DQ-Kwon,

I love the work you did for this. I think that as long as it will be on a different tag there wont be any impact for the user. Lets see what Hengry thinks about this :)

@DQ-Kwon
Copy link
Copy Markdown
Author

DQ-Kwon commented May 15, 2026

I agree with your point. I’ve reverted the beszel-agent-nvidia image and separated it into a new tag called beszel-agent-nvidia:slim. This should ensure that existing users won't face any issues.

@svenvg93
Copy link
Copy Markdown
Collaborator

Hi! #2016 got me triggered to see if we could not do one image to have all the monitoring in one place. Technically with your image the Intel GPU would also work if the intel_top_gpu is mounted in the container.

wondering what @henrygd thinks about this, as it would technically allow to have one image instead of 4 different ones.

@DQ-Kwon
Copy link
Copy Markdown
Author

DQ-Kwon commented May 18, 2026

Hi @svenvg93,

Thank you for the suggestion! I've spent some time reviewing the feasibility of an All-in-one image, and it's a very interesting concept.

However, due to architectural differences between GPU vendors, implementing this may be more challenging than it initially appears.

The current implementation relies heavily on the NVIDIA Container Toolkit. When a container is started with NVIDIA resources, the toolkit automatically exposes the /dev/nvidia* devices, the nvidia-smi binary, and the required shared libraries from the host into the container. This allows us to keep the image extremely lightweight (Distroless) while maintaining good driver compatibility.

For Intel and AMD GPUs, the situation is a bit different. While hardware access can generally be provided via --device /dev/dri, vendor-specific monitoring tools such as intel_gpu_top or radeontop — along with their dependent shared libraries — are not automatically available inside the container in the same way.

To support an All-in-one image under these constraints, we would likely face a significant trade-off:

  1. Fat Image Approach: Pre-installing vendor-specific tools for Intel, AMD, and NVIDIA inside the image. This would substantially increase the image size, which goes against the lightweight refactoring achieved in this PR.

  2. Manual Mounting Approach: Requiring users to manually mount host binaries and shared library paths into the container. This would make the docker-compose.yml considerably more complex and negatively impact the user experience.

Supporting all vendors cleanly within a single image would also require broader testing and long-term maintenance across multiple GPU ecosystems and host distributions.

I'm not deeply familiar with every non-NVIDIA GPU/container ecosystem, so there may be gaps or outdated assumptions in my review. I would definitely appreciate input from contributors with more experience in Intel or AMD GPU environments.

For the short term, I think keeping device-specific images is the more practical approach to preserve simplicity and optimization. That said, I agree the idea is valuable, and it may be worth revisiting later as a broader long-term improvement.

What are your thoughts on this?

@svenvg93
Copy link
Copy Markdown
Collaborator

Hi @DQ-Kwon,

Thanks for checking!
I was already afraid for it 😅 . Think that one image might be easier from an end user perspective, even if the size is big.
Having the smaller image like you are proposing also have its benefits. Neither of them of clear pros and cons. I will check if there are other options in terms for GPU monitoring to make it more lean.

Think the main question is what @henrygd thinks of it in long terms in terms of support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants