feat(slinky-slurm): add h100-gke-cos leaf and inline platform refs#997
Conversation
Move slinky-slurm-operator-crds, slinky-slurm-operator, and slinky-slurm from the platform-slurm / platform-slurm-cluster mixins into inline componentRefs on each slurm leaf, mirroring the dynamo-platform pattern used by *-inference-dynamo leaves. The mixinComponentRefSafeForMerge guard in pkg/recipe/metadata_store.go blocks leaf-level overrides on identity fields (source, version, valuesFile) of mixin-contributed components, which made it impossible to declare leaf-specific GPU GRES tuning (extraConfMap.Gres and slurmd.resources.limits.nvidia.com/gpu) from the leaf overlays. Inlining the three Slinky charts per slurm leaf moves them off the mixin merge path and onto the standard componentRef overlay path, unblocking the GRES wiring that srun --gres=gpu:N requires. Changes: - recipes/overlays/h100-eks-ubuntu-training-slurm.yaml: inline the three Slinky componentRefs with H100 GRES overrides on slinky-slurm - recipes/overlays/h100-kind-training-slurm.yaml: inline the three Slinky componentRefs (CPU-only, no GRES overrides) - recipes/mixins/platform-slurm.yaml, recipes/mixins/platform-slurm-cluster.yaml: removed; no remaining references in the live tree (only a historical CHANGELOG mention) - recipes/registry.yaml: update slinky-slurm comment to point to the inline pattern; defaults still drive source/version/namespace - docs/integrator/recipe-development.md, docs/user/component-catalog.md: rewrite the slurm-platform section to describe the inline pattern - docs/user/api-reference.md: add slinky-slurm to the bundler component table (pre-existing gap from when the cluster was mixin-opt-in)
First-class GKE/COS leaf for H100 training with the Slinky-managed Slurm cluster (Controller, LoginSet, NodeSet, RestApi). Inherits from h100-gke-cos-training and inlines the slinky-slurm trio (slinky-slurm-operator-crds, slinky-slurm-operator, slinky-slurm) with the same H100 GRES overrides as the EKS slurm leaf, so srun --gres=gpu:N allocates real GPUs (Gres=gpu:h100:8 in slurm.conf, nvidia.com/gpu: 8 on the slurmd pod). Unlike the EKS slurm leaf, no os-* mixin is composed: gke-cos.yaml already disables GPU driver installation and pins DRA / nodewright paths for COS's read-only rootfs. Drops the K8s-native NCCL bandwidth perf check inherited from h100-gke-cos-training: on a Slinky cluster the equivalent validation is a slurm-launched srun job (e.g. nccl-tests/all_reduce_perf) that exercises slurmd + the GPUDirect TCPXO plugin from the parent leaf. Conformance and deployment checks are inherited unchanged. Adds a deployment-order guard test case mirroring the existing h100-kind-training-slurm and h100-eks-ubuntu-training-slurm cases, asserting CRDs precede the operator, the operator precedes the cluster, cert-manager precedes the operator, and gpu-operator precedes nvsentinel.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR replaces centralized Slurm mixins with inline per-leaf Slinky Slurm componentRefs across overlays. Existing EKS and Kind Slurm overlays are updated to declare CRDs, operator, and cluster inline with leaf-specific GRES/override settings and DRA constraints. A new GKE/COS/H100 training Slurm overlay is added. Documentation, registry comments, and the API components list are updated to reflect the inline-per-leaf pattern. Tests are extended to cover NFD topology updater expectations and deployment-order guards for the new overlays. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/integrator/recipe-development.md`:
- Line 118: Split the dense paragraph into shorter sentences or a bulleted list
to improve readability: explain each concept separately (1) that a platform's
full component stack is declared inline per leaf overlay rather than via a
platform mixin, (2) that leaves can carry hardware-specific tuning like GPU
GRES, resource limits, and partition layout, (3) the concrete example that
`--platform slurm` leaves inline the SchedMD Slinky operator CRDs and the
operator and the Slinky-managed Slurm cluster as three `componentRefs` (same
shape `dynamo-platform` uses across `*-inference-dynamo` leaves), (4) the
variations where a leaf can inline only `slinky-slurm-operator-crds` +
`slinky-slurm-operator` to get the operator only, or add `slinky-slurm` with
leaf-specific `overrides` to get the full cluster, and (5) reference the example
file `recipes/overlays/h100-eks-ubuntu-training-slurm.yaml`; convert each
numbered concept into its own short sentence or bullet so the paragraph is
scan-friendly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: ac0ba934-f154-4860-95ab-06efc451df91
📒 Files selected for processing (10)
docs/integrator/recipe-development.mddocs/user/api-reference.mddocs/user/component-catalog.mdpkg/recipe/deployment_order_guard_test.gorecipes/mixins/platform-slurm-cluster.yamlrecipes/mixins/platform-slurm.yamlrecipes/overlays/h100-eks-ubuntu-training-slurm.yamlrecipes/overlays/h100-gke-cos-training-slurm.yamlrecipes/overlays/h100-kind-training-slurm.yamlrecipes/registry.yaml
💤 Files with no reviewable changes (2)
- recipes/mixins/platform-slurm-cluster.yaml
- recipes/mixins/platform-slurm.yaml
The single dense paragraph introduced in the slinky-slurm refactor was hard to scan. Break it into a short lead sentence, a bulleted list of the three inlined componentRefs (CRDs / operator / cluster), and a trailing line that points at dynamo-platform as the precedent and h100-eks-ubuntu-training-slurm.yaml as the worked example. Drops the speculative "partition layout" tuning example and the operator-only leaf variation: neither is exercised by any leaf in tree today, and documenting hypotheticals risks the same drift problem catalog-md hit before (NKX-10404).
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Approving — the refactor is well-motivated and well-documented.
Summary: The inline shape mirrors the established dynamo-platform precedent, the PR description traces the engine constraint (mixinComponentRefSafeForMerge identity-field guard) precisely, and the end-to-end evidence on aicr-eidosx (sinfo + 16 unique H100 UUIDs across both slurmd pods) is concrete. Nice catch on adding slinky-slurm to the docs/user/api-reference.md bundler table — that was a pre-existing gap from when it was opt-in via the deleted cluster mixin.
Verified locally at HEAD e51b4b16:
TestDeploymentOrderGuardspasses all 12 cases including the newh100-gke-cos-training-slurm.go test ./pkg/recipe/... -shortgreen.- All function references in the PR (
mergeMixins,mixinComponentRefSafeForMerge,ApplyRegistryDefaults) exist at the cited locations. - No orphaned consumers of the deleted mixins outside of stale planning docs in
docs/superpowers/.
One consistency question on the EKS vs GKE NCCL validation handling (inline), plus a non-blocking suggestion about extending TestNFDTopologyUpdater_OverlayCoverage to cover slurm leaves (pre-existing gap, not introduced by this PR).
The GKE slurm leaf already overrides validation.performance.checks to [] with the rationale that on a Slinky-managed cluster the K8s-native nccl-all-reduce-bw measures the wrong path — it launches a Pod past slurmd, so the bandwidth number reflects raw cluster networking, not what an `srun nccl-tests/all_reduce_perf` would see going through slurmd + the configured fabric (TCPXO on GKE, EFA on EKS). The argument generalizes to any Slinky cluster, so apply the same override on the EKS slurm leaf. Tighten the prose on both files: the old comment said "Conformance checks are inherited unchanged" but deployment checks are also inherited; replace with "Deployment and conformance checks are inherited unchanged" for precision.
TestNFDTopologyUpdater_OverlayCoverage enumerates every GPU overlay variant to catch overlays that accidentally lose NRT publishing on real-cluster leaves or accidentally turn it on for kind-chain leaves. The three slurm leaves (h100-eks-ubuntu-training-slurm, h100-gke-cos-training-slurm, h100-kind-training-slurm) were missing from the table — a pre-existing gap that this PR is the natural moment to close, since it adds one new slurm leaf and refactors the other two off the deleted platform-slurm mixins. Expected pattern matches the parent inheritance: EKS+GKE slurm leaves inherit topologyUpdater.enable=true from their h100-*-training parents; the kind slurm leaf stays off (no kubelet podResources socket on KWOK).
mchmarny
left a comment
There was a problem hiding this comment.
Clean refactor — moving slurm off the mixin merge path onto the same inline shape as dynamo-platform is the right call, the rationale in the description and YAML comments is thorough, and the live rehearsal (sinfo + srun -N2 nvidia-smi -L returning 16 H100 UUIDs) is solid evidence the GRES wiring actually works end-to-end. CI is green across all 142 checks.
Three findings, none blocking:
- Medium —
docs/integrator/recipe-development.mdline 118 overgeneralizes: claims "a platform's full component stack is declared inline … rather than via a platform mixin," butplatform-kubeflow/platform-inferenceare still active mixins and the example just above on the same page composesplatform-kubeflowviaspec.mixins. Scope the claim to slurm (and dynamo, if calling it out as precedent). - Nit — the GKE leaf's
K8s.server.version: ">= 1.32"constraint duplicates the parent. The EKS leaf's restatement is justified because it tightens to>= 1.32.4; here it's noise. - Nit — the ~20-line GRES rationale block is now duplicated verbatim across the EKS and GKE slurm leaves, and one of the references (
manifests/clusters/mars-prod/...) points at an internal NVIDIA repo that will read as opaque in the OSS tree.
Nothing here blocks merge — the medium finding is a doc accuracy issue, not a code defect.
- recipe-development.md: scope the "inline-per-leaf platform stack" claim to slurm and dynamo. The prior wording was too broad: platform-kubeflow and platform-inference are still active mixins (shown in the example two paragraphs up), and the dynamo-platform precedent already lives inline in *-inference-dynamo leaves. The new wording calls out slurm/dynamo as inline-per-leaf and explicitly preserves the mixin status of kubeflow/inference. - h100-gke-cos-training-slurm.yaml: drop the verbatim restatement of K8s.server.version (>= 1.32) from the parent. Same-name constraints get overridden during merge, so restating without strengthening is noise. The EKS slurm leaf restates >= 1.32.4 to tighten its parent; the GKE leaf has no tighter floor, so we inherit and add a one-line comment documenting why. - h100-eks-ubuntu-training-slurm.yaml and h100-gke-cos-training-slurm.yaml: drop the parenthetical pointer to mars-prod (manifests/clusters/mars-prod/...). That path lives in an internal NVIDIA repo and reads as opaque in the OSS tree. The rationale (Kubernetes auto-mirrors requests=limits for extended resources) stands on its own. Per Mark's review on NVIDIA#997.
|
@mchmarny could you please take a look? thanks! |
Summary
Adds the first-class
h100-gke-cos-training-slurmleaf and refactors the existing EKS/Kind slurm leaves to inline the three Slinky components per leaf (mirroring thedynamo-platformpattern), unblocking leaf-level GPU GRES tuning that the oldplatform-slurm/platform-slurm-clustermixin shape made impractical.Motivation / Context
The original
platform-slurm/platform-slurm-clustermixins contributedslinky-slurm-operator-crds,slinky-slurm-operator, andslinky-slurmto every slurm leaf. Because mixins are merged after the inheritance chain (pkg/recipe/metadata_store.go:mergeMixins) and the mixin's own componentRefs set identity fields (type,valuesFile,dependencyRefs), any leaf that pre-declared the same component name to attachoverrideswould causemixinComponentRefSafeForMergeto reject the mixin's identity-field entry as a conflict against the chain-existing component.In practice this meant the only way to push leaf-specific GPU GRES (
extraConfMap.Gres=gpu:h100:8andslurmd.resources.limits.nvidia.com/gpu: 8) intonodesets.slinkywas at install time via--set slinkyslurm:..., which made the leaf incomplete:srun --gres=gpu:NsawGres=(null)from any straightaicr bundle.Switching to the same inline pattern used by
dynamo-platformacross*-inference-dynamoleaves moves Slinky off the mixin merge path and onto the standard inheritance/overlay path, so each leaf carries its own hardware-specific tuning as part of the recipe itself.This change was developed and rehearsed end-to-end on the
aicr-eidosxGKE cluster: full 17-component bundle deploy on 2x H100 nodes,sinfoshowing both NodeSet pods idle withgpu:h100:8, andsrun -N2 --gres=gpu:8 nvidia-smi -Lreturning 16 unique H100 UUIDs across both hosts.Fixes: N/A
Related: NKX-9586, #948 (initial slinky-slurm chart + EKS/Kind leaves)
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)recipes/overlays,recipes/mixins,recipes/registry.yaml)Implementation Notes
Why inline instead of relaxing the mixin guard? The
mixinComponentRefSafeForMergeguard exists for a reason (ADR-005's "no silent chart identity override" mitigation): mixin-contributed components must be replaceable wholesale, not partially merged, to keep the mixin contract simple. Thedynamo-platformfamily already established the precedent of declaring platform components inline per leaf — this PR brings slurm in line with that pattern rather than weakening the guard.Why omit
source/version/valuesFileon the slinkycomponentRefs? Unlikedynamo-platform, the Slinky charts are not declared inbase.yaml. They live only inregistry.yamlwithdefaultRepository/defaultVersion/defaultChart, andApplyRegistryDefaults(pkg/recipe/metadata.go:156-184) populates them at recipe-resolve time. The registry comment onslinky-slurmexplicitly documentsdefaultVersionas the single source of truth (one-line bumps instead of touching every leaf). This shape diverges from dynamo's inlining but is engine-supported and reduces leaf duplication.GRES wiring requires two fields, not one. The chart does not derive
Gres=inslurm.conffrom pod resource limits, so leaves must declare GRES in both places:nodesets.slinky.extraConfMap.Gres(tells slurmctld which GPUs slurmd has) andnodesets.slinky.slurmd.resources.limits.nvidia.com/gpu(reserves the GPUs from the device plugin sogres.conf'sAutoDetect=nvidiafinds them). Both are documented inline in the GKE leaf with a cross-reference to the mars-prod reference values.API reference docs gap. While updating component docs I noticed
slinky-slurm(the cluster instance) was missing fromdocs/user/api-reference.md's bundler table — a pre-existing gap from when it was opt-in via the deletedplatform-slurm-clustermixin. Added in the same refactor commit so the docs match the inline reality.Out of scope (logged separately): multi-node NCCL via Pyxis/Enroot requires swapping to
ghcr.io/slinkyproject/slurmd-pyxisand configuringplugstack.conf; not attempted here.Testing
End-to-end rehearsal on
aicr-eidosx(GKE, 2x H100):Risk Assessment
helm upgrade --installbut no manifest churn.Rollout notes: No data migration. The deleted mixins were always paired (no leaf ever used
platform-slurmwithoutplatform-slurm-cluster), so removing them does not regress any composition that was actually used. Any out-of-tree leaf still referencing the mixins will fail recipe validation with a clear error and needs a one-time port to the inline shape — see the EKS leaf for the template.Checklist
make testwith-race)make lint) — exceptlint-go(local Go-version build mismatch ongolangci-lint; CI runs pinned version)deployment_order_guard_test.go)recipe-development.md,component-catalog.md,api-reference.md)dynamo-platforminline pattern)git commit -S) — verify before pushing