Skip to content

Support Read-Only, systemd-less Systems#382

Open
Arc676 wants to merge 2 commits into
NVIDIA:mainfrom
Arc676:talos
Open

Support Read-Only, systemd-less Systems#382
Arc676 wants to merge 2 commits into
NVIDIA:mainfrom
Arc676:talos

Conversation

@Arc676
Copy link
Copy Markdown

@Arc676 Arc676 commented May 19, 2026

Motivation

Informally, this PR adds (partial) support for Talos. Closes #356.

More formally, this PR adds support for systems with read-only filesystems and systems that do not run systemd.

Description

The MIG manager assumes that it will be able to copy the mig-parted binary to the host and use systemd to restart host-side GPU services. Neither of these is true for Talos, which is an immutable OS that doesn't run systemd. Proper Talos support would introduce a dependency on the Talos API, but that is beyond the scope of this PR and likely falls beyond the scope of what this tool should support.

This PR adds support for systems like Talos by introducing two new flags (both of which are required for Talos):

  1. Identifying the host as read-only: prevent the manager from attempting to copy data to the host
  2. Flagging the absence of systemd: tell the manager to skip all systemd operations that would otherwise cause the program to hang, since there would be no response on DBus

This PR includes nil-checks for the systemd manager that were not present before. In the original code, these checks are effectively unnecessary because this member is always initialized and the entire program blocks on this initialization if systemd is not present.

Improvements

This is the simplest possible solution to the problem described in the linked issue. All the MIG- and GPU-related operations work fine1 on Talos. We simply need to skip over the parts that can't work on Talos. The obvious alternatives or improvements over this PR are:

  1. Specifically catching the "read-only FS" error when attempting to copy the binary instead of requiring a flag to skip the operation entirely
  2. Detecting the presence or absence of systemd, either by inspecting the running processes or by introducing a timeout on the DBus connection, and adjusting accordingly, instead of requiring a flag

Caveats

This PR exists more for discussion than with the goal of being merged. These changes were made based on a very cursory reading and superficial understanding of the MIG manager. There is likely a cleaner and more elegant way to achieve this. However, I'll submit the patch as a proof-of-concept: by disabling the host-copies and all systemd features, the MIG manager works properly on Talos. This is, at least for us, an important starting point.

Footnotes

  1. CUDA validation yields ERROR: init 250 result=11s. I haven't yet figured out what this means, but so far it hasn't impacted the use of the GPU. The GPU workloads still run fine, as does the CUDA validation pod.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@linkages
Copy link
Copy Markdown

Just want to add my feedback to this. I just tested this with the following setup:

Environment:

  • OS: Talos Linux v1.13.0
  • Kubernetes: v1.34.0
  • GPUs: NVIDIA B300 NVL and RTX 6000 Pro Server
  • NVIDIA driver/toolkit: provided by Talos system extensions (580.159.03)
  • Kernel Version: 6.18.29-talos
  • GPU Operator: Helm chart v26.3.1

I had to build a new k8s-mig-manager based on @Arc676 repo. I then pushed it to docker.io/linkages/k8s-mig-manager:v0.14.1.

Then when I deploy the gpu-operator, I set the values for the helm chart using this:

driver:
  enabled: false

toolkit:
  enabled: false

hostPaths:
  driverInstallDir: /usr/local

mig:
  strategy: mixed

migManager:
  enabled: true
  repository: docker.io/linkages
  version: v0.14.1
  env:
    - name: READONLY_ROOTFS
      value: "true"
    - name: SYSTEMD_UNAVAILABLE
      value: "true"

operator:
  cleanupCRD: true

I then set the nvidia.com/mig.config label on all nodes to all-balanced and the mig-manager did the right thing in waiting for all the operator components to stop and then it adjusted the MiG settings and restarted everything back up. Shortly after the gpu-feature-discovery controller set the correct labels on the nodes.

This was tested on 2 different types of nodes in the same cluster:

2 x Lenovo nodes with 8 x NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs
and
1 x DGX B300 with 8 x NVIDIA B300 SXM6 AC

Thank you @Arc676 for this patch. I hope it or a more elegant version of this gets pulled upstream. For now this solves my problem.

@rajathagasthya
Copy link
Copy Markdown
Contributor

Thanks @Arc676 for tackling this!

My preference is to replace the flag-based approach in favor of something with no new public surface area. The reason being WITH_SHUTDOWN_HOST_GPU_CLIENTS=false already encodes the intent "don't touch the host". The current bug is that reconfigure.New() unconditionally connects to systemd even when WITH_SHUTDOWN_HOST_GPU_CLIENTS=false.

We can document that on platforms without systemd WITH_SHUTDOWN_HOST_GPU_CLIENTS=false is required. To fully close the loop on getting this to work on Talos, I would also suggest creating an issue to decouple WITH_SHUTDOWN_HOST_GPU_CLIENTS from IS_HOST_DRIVER in the GPU Operator helm chart.

In the meantime, we should remove the unconditional systemd.NewManager() call from reconfigure.New(), and replace it with lazy initialization and a context timeout. This way, when systemd isn't available, the MIG manager fails fast with a clear error instead of hanging silently.

@Arc676
Copy link
Copy Markdown
Author

Arc676 commented May 28, 2026

Thanks for the feedback! I agree with your point regarding the public surface; adding flags was the easiest approach but both properties can be inferred from the system's behavior. Instead of requiring the user to set these flags, nvidia-mig-manager can set them automatically.

In the meantime, we should remove the unconditional systemd.NewManager() call from reconfigure.New(), and replace it with lazy initialization and a context timeout. This way, when systemd isn't available, the MIG manager fails fast with a clear error instead of hanging silently.

Support for systems without systemd requires the option to disable the connection to systemd entirely. I'm not sure I understand what exactly you have in mind here.

What would be the advantage of a lazy init for the systemd manager? I suppose the startup would be slightly faster by skipping a step if the desired MIG configuration is already applied, but I think that determining the availability of systemd straight away makes more sense. In strictly managed environments, startup occurs at known times when the GPU operator is updated. Deferring the systemd check to when it's needed would mean that the latency would be incurred when the user attempts to change the MIG configuration. The first repartitioning operation after startup would be slower than the others.

We could keep the unconditional systemd.NewManager but introduce the timeout as you mentioned; if the connection times out, then we would set a flag to indicate that systemd is unavailable. In particular, the MIG manager should not fail in this case, but perhaps output a warning to ensure that the user is aware. Or did you mean that WITH_SHUTDOWN_HOST_GPU_CLIENTS=false should be equivalent to "systemd unavailable"?

Unless you want to separate the features, I'd implement all these changes in this PR such that Talos support is covered.

I've created a new issue to track the change to the Helm chart per your suggestion.

Arc676 added 2 commits May 29, 2026 13:17
Assume that host client shutdown flag reflects readonly FS

Signed-off-by: Alessandro Vinciguerra <alessandro.vinciguerra@postfinance.ch>
Add timeout for systemd connection
Autodetect systemd availability

Signed-off-by: Alessandro Vinciguerra <alessandro.vinciguerra@postfinance.ch>
@Arc676
Copy link
Copy Markdown
Author

Arc676 commented May 29, 2026

I've adapted the implementation based on the above comments:

  • Assume the Helm chart will be adapted such that WITH_SHUTDOWN_HOST_GPU_CLIENTS correctly reflects the intent "do not touch the host"; I've removed the flag for readonly hosts
  • Since the Reconfigure object is recreated each time, attempts to persist that fail due to a read-only FS issue a warning but don't return an error. Caching this finding would have to occur at the program's top-level; I suppose this could be added, but with the above point there are no attempts to write to the host at this level.
  • The DBus connection uses the same Context throughout; I didn't find a way to set a timeout on just the initial connection. DBus closes the network connection when the context times out and issues a corresponding warning that is not wrapped with context.DeadlineExceeded. It's not particularly elegant but as a workaround I changed the initialization function to try twice: once with a timeout, after which systemd is flagged as unavailable (at least for the current reconfiguration attempt), and a second time with the parent context. As before, we'd need to query systemd outside the Reconfigure object to be able to cache the result. I've left in a flag to change the timeout, primarily to avoid having a fixed constant in the code. However, this does mean that every reconfiguration will have to wait for this timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: k8s-mig-manager v0.14.0 observes nvidia.com/mig.config label but does not apply geometry on Talos

3 participants