SCHED-1049: rallback topology controller and fixed bug with dynamic topology#2263
SCHED-1049: rallback topology controller and fixed bug with dynamic topology#2263Uburro wants to merge 1 commit intosoperator-release-3.0from
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the worker topology controller in a Slurm operator, introducing a graph-based topology configuration builder (TopologyGraph) and a block-based topology builder (TopologyBlocks). It replaces the previous BuildNodeSetTopologyConf function (which used NodeSet-based range notation) with dynamic pod-based topology construction that groups running worker pods by their actual node assignments and topology labels.
Changes:
- Added
TopologyGraph(tree topology) andTopologyBlocks(block topology) builders that construct Slurm topology configuration from node labels and pod assignments, filtering out nodes without running pods. - Refactored
Reconcileto fetch the existing topology config from a separate ConfigMap (topology-config) for hash comparison, and addedfetchConfigMapas a reusable generic method. - Replaced the static NodeSet-based topology generation with dynamic pod collection (
collectRunningWorkerPods) and node grouping (GroupPodNamesByNode).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
internal/controller/topologyconfcontroller/topology_graph.go |
New file: implements TopologyGraph with tree-building, single-root enforcement, and Slurm config rendering |
internal/controller/topologyconfcontroller/topology_graph_test.go |
New file: comprehensive tests for graph-based topology rendering across various tier configurations |
internal/controller/topologyconfcontroller/topology_blocks.go |
New file: implements TopologyBlocks for block-based topology grouping by tier-0 labels |
internal/controller/topologyconfcontroller/topology_blocks_test.go |
New file: tests for block topology grouping, unknown block assignment, and empty input |
internal/controller/topologyconfcontroller/workertopology_controller.go |
Refactored reconciliation: dynamic pod collection, generic fetchConfigMap, removed old BuildNodeSetTopologyConf and formatNodeRange |
internal/controller/topologyconfcontroller/workertopology_controller_test.go |
Updated test to reflect GetPodsByNode → GroupPodNamesByNode rename |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/controller/topologyconfcontroller/workertopology_controller.go
Outdated
Show resolved
Hide resolved
20412da to
9a69b84
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/controller/topologyconfcontroller/workertopology_controller_test.go
Show resolved
Hide resolved
internal/controller/topologyconfcontroller/workertopology_controller.go
Outdated
Show resolved
Hide resolved
internal/controller/topologyconfcontroller/workertopology_controller.go
Outdated
Show resolved
Hide resolved
internal/controller/topologyconfcontroller/workertopology_controller.go
Outdated
Show resolved
Hide resolved
internal/controller/topologyconfcontroller/workertopology_controller.go
Outdated
Show resolved
Hide resolved
1bd30f4 to
99836f5
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Problem
After migrating
soperatorto dynamic topology, we found a scheduling regression: withouttopology.conf, job scheduling behaved incorrectly, specifically with invalidSLURM_TOPOLOGY_ADDRvalues.During investigation and Slurm experiments, we confirmed that topology behavior depends on local
slurmdstate, andslurmdinitializes this state fromtopology.conf. As a result, dynamic-only topology was not sufficient for stable scheduling.Solution
We restored
topology.confgeneration while keeping dynamic topology support, and refactored the controller logic to make this flow reliable:TopologyGraphTopologyBlocksConfigMap+JailedConfig)/etc/slurm/topology.confbefore applying node topology.Testing
go test ./internal/controller/topologyconfcontroller/...make lint(passes; cache warnings may appear in sandboxed environments)worker_init.py)Release Notes
SLURM_TOPOLOGY_ADDR.topology.confas the authoritative input forslurmdtopology state.