Skip to content

SCHED-1049: add slurmd workaround for topology.conf#2235

Draft
Uburro wants to merge 1 commit intosoperator-release-3.0from
SCHED-1049/0
Draft

SCHED-1049: add slurmd workaround for topology.conf#2235
Uburro wants to merge 1 commit intosoperator-release-3.0from
SCHED-1049/0

Conversation

@Uburro
Copy link
Collaborator

@Uburro Uburro commented Feb 27, 2026

Problem

Solution

Testing

Release Notes

@Uburro Uburro added bug Something isn't working fix labels Feb 27, 2026
@Uburro Uburro force-pushed the SCHED-1049/0 branch 2 times, most recently from 2ed2aa6 to 158799c Compare March 2, 2026 20:58
@Uburro Uburro requested a review from Copilot March 3, 2026 09:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a per-pod topology.conf generation + mount flow intended to work around slurmd issues when topology is enabled, by writing a pod-specific topology.conf in an init container and making it available to the slurmd container.

Changes:

  • Add an EmptyDir volume intended to carry a dynamically generated topology.conf.
  • Update worker init container to mount the new volume and export DYNAMIC_TOPOLOGY_PATH.
  • Extend worker_init.py to generate and write a minimal per-pod topology.conf during wait-topology.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
internal/render/worker/statefulset_test.go Updates init-container expectations to account for new env var + volume mount.
internal/render/worker/statefulset.go Adds a new EmptyDir volume when topology plugin is enabled.
internal/render/worker/container.go Mounts the new dynamic-topology volume into both init and slurmd containers; adds DYNAMIC_TOPOLOGY_PATH env var.
internal/consts/volume.go Introduces constants for the dynamic topology volume name and mount paths.
images/worker/worker_init.py Adds topology.conf generation + writing logic and wires it into wait_for_topology().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +34 to +41
assert.Equal(t, 11, len(result.Env)) // 6 base + 1 NODESET_GPU_ENABLED + 3 topology + 1 DYNAMIC_TOPOLOGY_PATH
assert.Equal(t, 4, len(result.VolumeMounts))

expectedMounts := map[string]string{
consts.VolumeNameJail: consts.VolumeMountPathJail,
consts.VolumeNameMungeSocket: consts.VolumeMountPathMungeSocket,
consts.VolumeNameTopologyNodeLabels: consts.VolumeMountPathTopologyNodeLabels,
consts.VolumeNameDynamicTopology: consts.VolumeMountPathDynamicTopologyInit,
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test update only checks the total Env/VolumeMount counts; it would be more robust to also assert the presence and value of the new DYNAMIC_TOPOLOGY_PATH env var (and the expected mount path) to avoid false positives if other env vars/mounts are added/removed in the future.

Copilot uses AI. Check for mistakes.
Comment on lines +167 to +176
volumeMounts = append(volumeMounts,
corev1.VolumeMount{
Name: consts.VolumeNameTopologyNodeLabels,
MountPath: consts.VolumeMountPathTopologyNodeLabels,
ReadOnly: true,
},
corev1.VolumeMount{
Name: consts.VolumeNameDynamicTopology,
MountPath: consts.VolumeMountPathDynamicTopology,
},
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mounting the EmptyDir volume at /etc/slurm (via VolumeMountPathDynamicTopology) will shadow the existing Slurm config directory and can break the container startup logic that symlinks /etc/slurm to the jail (and/or make slurmd miss slurm.conf, etc.). Consider mounting the EmptyDir as a single file using subPath (e.g., mount topology.conf into /etc/slurm/topology.conf) or mounting into a non-conflicting directory and bind-mount/copy the file into the final location during init/entrypoint.

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +92

VolumeNameDynamicTopology = "dynamic-slurm-topology"
VolumeMountPathDynamicTopology = "/etc/slurm"
VolumeMountPathDynamicTopologyInit = "/mnt/dynamic-topology"
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VolumeMountPathDynamicTopology is set to /etc/slurm, which is the full Slurm config directory. Using this as a mount target for an EmptyDir risks hiding the real Slurm configs. If the intent is only to provide topology.conf, consider changing this constant to a file path (e.g. /etc/slurm/topology.conf) and mounting the EmptyDir file with subPath, or otherwise avoid mounting over the whole directory.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants