-
Notifications
You must be signed in to change notification settings - Fork 54
Description
With the move from name‑bound worker StatefulSets to NodeSets, the old implicit “one cluster per namespace” assumption is no longer valid. The operator still enforces this via GetClusterInNamespace and topology ConfigMap generation that doesn’t distinguish clusters. We need to remove the single‑cluster restriction and generate topology ConfigMaps per cluster, labeled for ownership.
Problem
Historically, before NodeSets, there was a name binding between worker StatefulSets and the cluster name, which effectively limited multi‑cluster usage in the same namespace. In newer versions, NodeSets remove that name coupling, so multiple clusters in one namespace should be supported.
However, cluster.go hard‑fails when more than one SlurmCluster exists in a namespace:
if len(clusterList.Items) > 1 {
err := fmt.Errorf("multiple %s resources found in namespace %q", slurmv1.KindSlurmCluster, namespace)
logger.Error(err, fmt.Sprintf("%d %s resources found in namespace %q. This should not happen and definitely is a bug", len(clusterList.Items), slurmv1.KindSlurmCluster, namespace))
return nil, err
}
This prevents multi‑cluster deployments in the same namespace. Also, topology ConfigMap generation is not cluster‑scoped, so it can collide or be ambiguous once multiple clusters exist.
Proposed change
- Remove the single‑cluster restriction in GetClusterInNamespace so that multiple SlurmCluster objects can exist in one namespace.
- Generate topology ConfigMaps per cluster, with cluster name in the resource name (or another unique identifier).
- Add labels to topology ConfigMaps to indicate cluster ownership (e.g., slurm.nebius.ai/cluster=), so it’s clear which cluster a ConfigMap belongs to.
Acceptance criteria
- Multiple SlurmCluster resources can exist in the same namespace without operator errors.
- Topology ConfigMaps are generated per cluster and do not collide.
- Topology ConfigMaps carry labels identifying the owning cluster.
- Context / Rationale
- NodeSets removed the old name‑binding between worker StatefulSets and cluster name. The operator should reflect this change and allow multiple clusters per namespace while keeping topology resources unambiguous.
Files likely involved
cluster.go
nodetopology_controller.go
workertopology_controller.go (if it generates or references topology ConfigMaps)
If you want, I can also propose specific naming/labeling conventions for the ConfigMaps.