SCHED-1049: add slurmd workaround for topology.conf#2235
SCHED-1049: add slurmd workaround for topology.conf#2235Uburro wants to merge 1 commit intosoperator-release-3.0from
Conversation
2ed2aa6 to
158799c
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces a per-pod topology.conf generation + mount flow intended to work around slurmd issues when topology is enabled, by writing a pod-specific topology.conf in an init container and making it available to the slurmd container.
Changes:
- Add an
EmptyDirvolume intended to carry a dynamically generatedtopology.conf. - Update worker init container to mount the new volume and export
DYNAMIC_TOPOLOGY_PATH. - Extend
worker_init.pyto generate and write a minimal per-podtopology.confduringwait-topology.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/render/worker/statefulset_test.go | Updates init-container expectations to account for new env var + volume mount. |
| internal/render/worker/statefulset.go | Adds a new EmptyDir volume when topology plugin is enabled. |
| internal/render/worker/container.go | Mounts the new dynamic-topology volume into both init and slurmd containers; adds DYNAMIC_TOPOLOGY_PATH env var. |
| internal/consts/volume.go | Introduces constants for the dynamic topology volume name and mount paths. |
| images/worker/worker_init.py | Adds topology.conf generation + writing logic and wires it into wait_for_topology(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| assert.Equal(t, 11, len(result.Env)) // 6 base + 1 NODESET_GPU_ENABLED + 3 topology + 1 DYNAMIC_TOPOLOGY_PATH | ||
| assert.Equal(t, 4, len(result.VolumeMounts)) | ||
|
|
||
| expectedMounts := map[string]string{ | ||
| consts.VolumeNameJail: consts.VolumeMountPathJail, | ||
| consts.VolumeNameMungeSocket: consts.VolumeMountPathMungeSocket, | ||
| consts.VolumeNameTopologyNodeLabels: consts.VolumeMountPathTopologyNodeLabels, | ||
| consts.VolumeNameDynamicTopology: consts.VolumeMountPathDynamicTopologyInit, |
There was a problem hiding this comment.
This test update only checks the total Env/VolumeMount counts; it would be more robust to also assert the presence and value of the new DYNAMIC_TOPOLOGY_PATH env var (and the expected mount path) to avoid false positives if other env vars/mounts are added/removed in the future.
| volumeMounts = append(volumeMounts, | ||
| corev1.VolumeMount{ | ||
| Name: consts.VolumeNameTopologyNodeLabels, | ||
| MountPath: consts.VolumeMountPathTopologyNodeLabels, | ||
| ReadOnly: true, | ||
| }, | ||
| corev1.VolumeMount{ | ||
| Name: consts.VolumeNameDynamicTopology, | ||
| MountPath: consts.VolumeMountPathDynamicTopology, | ||
| }, |
There was a problem hiding this comment.
Mounting the EmptyDir volume at /etc/slurm (via VolumeMountPathDynamicTopology) will shadow the existing Slurm config directory and can break the container startup logic that symlinks /etc/slurm to the jail (and/or make slurmd miss slurm.conf, etc.). Consider mounting the EmptyDir as a single file using subPath (e.g., mount topology.conf into /etc/slurm/topology.conf) or mounting into a non-conflicting directory and bind-mount/copy the file into the final location during init/entrypoint.
internal/consts/volume.go
Outdated
|
|
||
| VolumeNameDynamicTopology = "dynamic-slurm-topology" | ||
| VolumeMountPathDynamicTopology = "/etc/slurm" | ||
| VolumeMountPathDynamicTopologyInit = "/mnt/dynamic-topology" |
There was a problem hiding this comment.
VolumeMountPathDynamicTopology is set to /etc/slurm, which is the full Slurm config directory. Using this as a mount target for an EmptyDir risks hiding the real Slurm configs. If the intent is only to provide topology.conf, consider changing this constant to a file path (e.g. /etc/slurm/topology.conf) and mounting the EmptyDir file with subPath, or otherwise avoid mounting over the whole directory.
Problem
Solution
Testing
Release Notes