Skip to content

SCHED-1049: rallback topology controller and fixed bug with dynamic topology#2263

Open
Uburro wants to merge 1 commit intosoperator-release-3.0from
SCHED-1049/2
Open

SCHED-1049: rallback topology controller and fixed bug with dynamic topology#2263
Uburro wants to merge 1 commit intosoperator-release-3.0from
SCHED-1049/2

Conversation

@Uburro
Copy link
Collaborator

@Uburro Uburro commented Mar 5, 2026

Problem

After migrating soperator to dynamic topology, we found a scheduling regression: without topology.conf, job scheduling behaved incorrectly, specifically with invalid SLURM_TOPOLOGY_ADDR values.

During investigation and Slurm experiments, we confirmed that topology behavior depends on local slurmd state, and slurmd initializes this state from topology.conf. As a result, dynamic-only topology was not sufficient for stable scheduling.

Solution

We restored topology.conf generation while keeping dynamic topology support, and refactored the controller logic to make this flow reliable:

  • Reintroduced/kept topology config generation in the worker topology controller.
  • Rolled back structured topology builders:
    • tree mode via TopologyGraph
    • block mode via TopologyBlocks
  • Ensured generated topology includes a stable root-switch model and unknown-node fallback.
  • Refactored reconciliation flow:
    • ensure topology resources exist (ConfigMap + JailedConfig)
    • collect running worker pods
    • build desired topology config
    • compare hash and update only when changed
  • Updated worker init behavior to wait until hostname is present in /etc/slurm/topology.conf before applying node topology.

Testing

  • go test ./internal/controller/topologyconfcontroller/...
  • make lint (passes; cache warnings may appear in sandboxed environments)
  • Updated/added unit tests for:
    • topology graph/block rendering
    • worker pod collection logic
    • worker init topology wait behavior (worker_init.py)
    • manually

Release Notes

  • Fixed dynamic-topology scheduling regression related to invalid SLURM_TOPOLOGY_ADDR.
  • Restored generation/use of topology.conf as the authoritative input for slurmd topology state.
  • Improved topology controller architecture and test coverage for tree/block/fallback cases.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the worker topology controller in a Slurm operator, introducing a graph-based topology configuration builder (TopologyGraph) and a block-based topology builder (TopologyBlocks). It replaces the previous BuildNodeSetTopologyConf function (which used NodeSet-based range notation) with dynamic pod-based topology construction that groups running worker pods by their actual node assignments and topology labels.

Changes:

  • Added TopologyGraph (tree topology) and TopologyBlocks (block topology) builders that construct Slurm topology configuration from node labels and pod assignments, filtering out nodes without running pods.
  • Refactored Reconcile to fetch the existing topology config from a separate ConfigMap (topology-config) for hash comparison, and added fetchConfigMap as a reusable generic method.
  • Replaced the static NodeSet-based topology generation with dynamic pod collection (collectRunningWorkerPods) and node grouping (GroupPodNamesByNode).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
internal/controller/topologyconfcontroller/topology_graph.go New file: implements TopologyGraph with tree-building, single-root enforcement, and Slurm config rendering
internal/controller/topologyconfcontroller/topology_graph_test.go New file: comprehensive tests for graph-based topology rendering across various tier configurations
internal/controller/topologyconfcontroller/topology_blocks.go New file: implements TopologyBlocks for block-based topology grouping by tier-0 labels
internal/controller/topologyconfcontroller/topology_blocks_test.go New file: tests for block topology grouping, unknown block assignment, and empty input
internal/controller/topologyconfcontroller/workertopology_controller.go Refactored reconciliation: dynamic pod collection, generic fetchConfigMap, removed old BuildNodeSetTopologyConf and formatNodeRange
internal/controller/topologyconfcontroller/workertopology_controller_test.go Updated test to reflect GetPodsByNodeGroupPodNamesByNode rename

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Uburro Uburro force-pushed the SCHED-1049/2 branch 2 times, most recently from 20412da to 9a69b84 Compare March 6, 2026 12:20
@Uburro Uburro requested a review from Copilot March 6, 2026 12:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Uburro Uburro force-pushed the SCHED-1049/2 branch 2 times, most recently from 1bd30f4 to 99836f5 Compare March 6, 2026 12:39
@Uburro Uburro requested a review from Copilot March 6, 2026 12:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Uburro Uburro changed the title SCHED-1049: rallback topology controller and refactor some logic SCHED-1049: rallback topology controller and fixed bug with dynamic topology Mar 6, 2026
@Uburro Uburro marked this pull request as ready for review March 6, 2026 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants