Skip to content

Allow multiple SlurmClusters per namespace; make topology ConfigMaps cluster-scoped #2126

@Uburro

Description

@Uburro

With the move from name‑bound worker StatefulSets to NodeSets, the old implicit “one cluster per namespace” assumption is no longer valid. The operator still enforces this via GetClusterInNamespace and topology ConfigMap generation that doesn’t distinguish clusters. We need to remove the single‑cluster restriction and generate topology ConfigMaps per cluster, labeled for ownership.

Problem

Historically, before NodeSets, there was a name binding between worker StatefulSets and the cluster name, which effectively limited multi‑cluster usage in the same namespace. In newer versions, NodeSets remove that name coupling, so multiple clusters in one namespace should be supported.

However, cluster.go hard‑fails when more than one SlurmCluster exists in a namespace:

if len(clusterList.Items) > 1 {
    err := fmt.Errorf("multiple %s resources found in namespace %q", slurmv1.KindSlurmCluster, namespace)
    logger.Error(err, fmt.Sprintf("%d %s resources found in namespace %q. This should not happen and definitely is a bug", len(clusterList.Items), slurmv1.KindSlurmCluster, namespace))
    return nil, err
}

This prevents multi‑cluster deployments in the same namespace. Also, topology ConfigMap generation is not cluster‑scoped, so it can collide or be ambiguous once multiple clusters exist.

Proposed change

  • Remove the single‑cluster restriction in GetClusterInNamespace so that multiple SlurmCluster objects can exist in one namespace.
  • Generate topology ConfigMaps per cluster, with cluster name in the resource name (or another unique identifier).
  • Add labels to topology ConfigMaps to indicate cluster ownership (e.g., slurm.nebius.ai/cluster=), so it’s clear which cluster a ConfigMap belongs to.

Acceptance criteria

  • Multiple SlurmCluster resources can exist in the same namespace without operator errors.
  • Topology ConfigMaps are generated per cluster and do not collide.
  • Topology ConfigMaps carry labels identifying the owning cluster.
  • Context / Rationale
  • NodeSets removed the old name‑binding between worker StatefulSets and cluster name. The operator should reflect this change and allow multiple clusters per namespace while keeping topology resources unambiguous.

Files likely involved

cluster.go
nodetopology_controller.go
workertopology_controller.go (if it generates or references topology ConfigMaps)
If you want, I can also propose specific naming/labeling conventions for the ConfigMaps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions