feat(slinky-slurm): add conformance validator and CUJ demo#1394
feat(slinky-slurm): add conformance validator and CUJ demo#1394kaynetu wants to merge 3 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR introduces a Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@demos/cuj1-slinky-slurm.md`:
- Line 11: The Markdown spans on line 11 and line 51 have malformed inline code
with unmatched bold markers. The pattern `**nodeGroup**` incorrectly places bold
delimiters inside backticks creating invalid Markdown. Fix this by either
removing the double asterisks and keeping only the backtick code formatting as
`nodeGroup`, or by repositioning the bold markers outside the backticks to
render as **`nodeGroup`** if bold emphasis is intended.
- Around line 154-158: Update the documented srun command timeout in the
conformance phase row of the table to match the actual implementation. Change
the timeout value from --time=0:01 to --time=0:03 in the conformance row to
align with what CheckSlinkySlurmHealth in
validators/conformance/slinky_slurm_health_check.go actually executes, ensuring
the documentation accurately reflects the validator's behavior.
In `@validators/conformance/slinky_slurm_health_check.go`:
- Around line 263-316: The controllerRef field extraction in the
listSlinkySetsForController function at lines 287-288 discards error and found
return values from NestedString calls, which could hide type mismatches or other
errors. Refactor the controllerName and controllerNamespace extraction to
capture all three return values from NestedString instead of using blanks, then
add explicit error checking similar to the selector extraction pattern at lines
295-299. Check if the error is not nil and return an appropriate error, and
optionally verify that the fields are found before proceeding with the
controller name comparison logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: d5da9b2a-04d2-4b28-ba2a-41fdfe830cba
⛔ Files ignored due to path filters (79)
go.sumis excluded by!**/*.sumvendor/github.com/gorilla/websocket/.gitignoreis excluded by!vendor/**vendor/github.com/gorilla/websocket/AUTHORSis excluded by!vendor/**vendor/github.com/gorilla/websocket/LICENSEis excluded by!vendor/**vendor/github.com/gorilla/websocket/README.mdis excluded by!vendor/**vendor/github.com/gorilla/websocket/client.gois excluded by!vendor/**vendor/github.com/gorilla/websocket/compression.gois excluded by!vendor/**vendor/github.com/gorilla/websocket/conn.gois excluded by!vendor/**vendor/github.com/gorilla/websocket/doc.gois excluded by!vendor/**vendor/github.com/gorilla/websocket/join.gois excluded by!vendor/**vendor/github.com/gorilla/websocket/json.gois excluded by!vendor/**vendor/github.com/gorilla/websocket/mask.gois excluded by!vendor/**vendor/github.com/gorilla/websocket/mask_safe.gois excluded by!vendor/**vendor/github.com/gorilla/websocket/prepared.gois excluded by!vendor/**vendor/github.com/gorilla/websocket/proxy.gois excluded by!vendor/**vendor/github.com/gorilla/websocket/server.gois excluded by!vendor/**vendor/github.com/gorilla/websocket/util.gois excluded by!vendor/**vendor/github.com/moby/spdystream/CONTRIBUTING.mdis excluded by!vendor/**vendor/github.com/moby/spdystream/LICENSEis excluded by!vendor/**vendor/github.com/moby/spdystream/MAINTAINERSis excluded by!vendor/**vendor/github.com/moby/spdystream/NOTICEis excluded by!vendor/**vendor/github.com/moby/spdystream/README.mdis excluded by!vendor/**vendor/github.com/moby/spdystream/connection.gois excluded by!vendor/**vendor/github.com/moby/spdystream/handlers.gois excluded by!vendor/**vendor/github.com/moby/spdystream/priority.gois excluded by!vendor/**vendor/github.com/moby/spdystream/spdy/LICENSEis excluded by!vendor/**vendor/github.com/moby/spdystream/spdy/PATENTSis excluded by!vendor/**vendor/github.com/moby/spdystream/spdy/dictionary.gois excluded by!vendor/**vendor/github.com/moby/spdystream/spdy/options.gois excluded by!vendor/**vendor/github.com/moby/spdystream/spdy/read.gois excluded by!vendor/**vendor/github.com/moby/spdystream/spdy/types.gois excluded by!vendor/**vendor/github.com/moby/spdystream/spdy/write.gois excluded by!vendor/**vendor/github.com/moby/spdystream/stream.gois excluded by!vendor/**vendor/github.com/moby/spdystream/utils.gois excluded by!vendor/**vendor/golang.org/x/net/internal/socks/client.gois excluded by!vendor/**vendor/golang.org/x/net/internal/socks/socks.gois excluded by!vendor/**vendor/golang.org/x/net/proxy/dial.gois excluded by!vendor/**vendor/golang.org/x/net/proxy/direct.gois excluded by!vendor/**vendor/golang.org/x/net/proxy/per_host.gois excluded by!vendor/**vendor/golang.org/x/net/proxy/proxy.gois excluded by!vendor/**vendor/golang.org/x/net/proxy/socks5.gois excluded by!vendor/**vendor/golang.org/x/net/websocket/client.gois excluded by!vendor/**vendor/golang.org/x/net/websocket/dial.gois excluded by!vendor/**vendor/golang.org/x/net/websocket/hybi.gois excluded by!vendor/**vendor/golang.org/x/net/websocket/server.gois excluded by!vendor/**vendor/golang.org/x/net/websocket/websocket.gois excluded by!vendor/**vendor/k8s.io/apimachinery/pkg/util/httpstream/doc.gois excluded by!vendor/**vendor/k8s.io/apimachinery/pkg/util/httpstream/httpstream.gois excluded by!vendor/**vendor/k8s.io/apimachinery/pkg/util/httpstream/spdy/doc.gois excluded by!vendor/**vendor/k8s.io/apimachinery/pkg/util/httpstream/spdy/spdy.gois excluded by!vendor/**vendor/k8s.io/apimachinery/pkg/util/remotecommand/constants.gois excluded by!vendor/**vendor/k8s.io/client-go/tools/remotecommand/OWNERSis excluded by!vendor/**vendor/k8s.io/client-go/tools/remotecommand/doc.gois excluded by!vendor/**vendor/k8s.io/client-go/tools/remotecommand/errorstream.gois excluded by!vendor/**vendor/k8s.io/client-go/tools/remotecommand/fallback.gois excluded by!vendor/**vendor/k8s.io/client-go/tools/remotecommand/reader.gois excluded by!vendor/**vendor/k8s.io/client-go/tools/remotecommand/remotecommand.gois excluded by!vendor/**vendor/k8s.io/client-go/tools/remotecommand/resize.gois excluded by!vendor/**vendor/k8s.io/client-go/tools/remotecommand/spdy.gois excluded by!vendor/**vendor/k8s.io/client-go/tools/remotecommand/v1.gois excluded by!vendor/**vendor/k8s.io/client-go/tools/remotecommand/v2.gois excluded by!vendor/**vendor/k8s.io/client-go/tools/remotecommand/v3.gois excluded by!vendor/**vendor/k8s.io/client-go/tools/remotecommand/v4.gois excluded by!vendor/**vendor/k8s.io/client-go/tools/remotecommand/v5.gois excluded by!vendor/**vendor/k8s.io/client-go/tools/remotecommand/websocket.gois excluded by!vendor/**vendor/k8s.io/client-go/transport/spdy/spdy.gois excluded by!vendor/**vendor/k8s.io/client-go/transport/websocket/roundtripper.gois excluded by!vendor/**vendor/k8s.io/client-go/util/exec/exec.gois excluded by!vendor/**vendor/k8s.io/streaming/LICENSEis excluded by!vendor/**vendor/k8s.io/streaming/pkg/httpstream/doc.gois excluded by!vendor/**vendor/k8s.io/streaming/pkg/httpstream/httpstream.gois excluded by!vendor/**vendor/k8s.io/streaming/pkg/httpstream/spdy/connection.gois excluded by!vendor/**vendor/k8s.io/streaming/pkg/httpstream/spdy/roundtripper.gois excluded by!vendor/**vendor/k8s.io/streaming/pkg/httpstream/spdy/upgrade.gois excluded by!vendor/**vendor/k8s.io/streaming/pkg/httpstream/wsstream/conn.gois excluded by!vendor/**vendor/k8s.io/streaming/pkg/httpstream/wsstream/doc.gois excluded by!vendor/**vendor/k8s.io/streaming/pkg/httpstream/wsstream/stream.gois excluded by!vendor/**vendor/k8s.io/streaming/pkg/runtime/runtime.gois excluded by!vendor/**vendor/modules.txtis excluded by!vendor/**
📒 Files selected for processing (17)
demos/README.mddemos/cuj1-slinky-slurm.mddocs/contributor/validator.mddocs/user/component-catalog.mdgo.modrecipes/checks/slinky-slurm/health-check.yamlrecipes/evidence/gb200-eks-ubuntu-training.yamlrecipes/evidence/h100-gke-cos-training.yamlrecipes/overlays/h100-eks-ubuntu-training-slurm.yamlrecipes/overlays/h100-gke-cos-training-slurm.yamlrecipes/overlays/h100-kind-training-slurm.yamlrecipes/validators/catalog.yamlvalidators/conformance/main.govalidators/conformance/pod_exec.govalidators/conformance/pod_exec_test.govalidators/conformance/slinky_slurm_health_check.govalidators/conformance/slinky_slurm_health_check_test.go
Recipe evidence checkAffected leaf overlays: 5
How to refresh evidenceRun on a cluster matching the recipe's aicr snapshot -o snapshot.yaml
aicr validate \
-r recipes/overlays/<slug>.yaml \
-s snapshot.yaml \
--emit-attestation ./out \
--push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yamlThis gate is warning-only and never blocks merge. See ADR-007 for the trust model. |
mchmarny
left a comment
There was a problem hiding this comment.
Solid, well-tested addition. The check follows project patterns closely — pkg/errors with codes, fail-closed validators.Skip on missing Slinky API and KWOK-only nodes, test-injectable seams (newPodExecExecutor, slinkyExecCommand), and genuinely thorough table-driven tests (multi-failure aggregation, malformed controllerRef, custom CR selectors). Catalog 5m timeout and the 10m→7m assert trim under the 8m envelope are reasonable and well-documented in validator.md.
Two non-blocking comments inline: one medium robustness concern on first-container exec selection (pod_exec.go), one nit on a near-duplicate component-presence helper. Neither blocks merge.
CI: ClamAV + grype green; Tier-1 e2e still running at review time — worth a glance before merge.
| Namespace(namespace). | ||
| SubResource("exec"). | ||
| VersionedParams(&corev1.PodExecOptions{ | ||
| Container: pod.Spec.Containers[0].Name, |
There was a problem hiding this comment.
Containers[0].Name blindly picks the first container. Slinky login pods commonly carry sidecars (munge/sssd/logging) — and this PR's own pod_exec_test.go models a 2-container pod with a sidecar. If the slurm client container isn't index 0, exec lands in the wrong container and the check fails spuriously (scontrol/srun not found → nonzero exit).
It fails closed (false-negative, not a spurious pass), so not a blocker — but consider honoring the kubectl.kubernetes.io/default-container annotation (Slinky sets it on the login pod) or matching a known container name, with index-0 as fallback.
| return nil | ||
| } | ||
|
|
||
| func recipeHasEnabledComponent(ctx *validators.Context, name string) bool { |
There was a problem hiding this comment.
nit: this near-duplicates recipeHasComponent (robust_controller_check.go:56) — the only difference is the added IsEnabled() gate. Two almost-identical component-presence helpers in the same package will confuse the next reader. Consider consolidating into one; the IsEnabled() semantics here are arguably the more correct behavior for the other callers too.
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Solid, well-tested validator — clean exec seam, exact-message assertions, KWOK skip, and selector-based pod discovery. go test ./validators/conformance/... is green locally.
One thing to settle before merge: the slurm overlays declare a validation.conformance.checks block, and the recipe merge replaces the inherited conformance phase rather than appending to it (pkg/recipe/metadata.go: s.Validation.Conformance = cloneValidationPhase(other.Validation.Conformance)). Net effect — each slurm leaf resolves to conformance = [slinky-slurm-health] only, dropping the 10 checks inherited from the eks/gke training parents (platform-health, gpu-operator-health, dra-support, secure-accelerator-access, …). That may be intended, but the overlay comments and the PR testing notes ("other inherited conformance checks failed/skipped on Kind — expected") read as if the checks are additive, which they aren't. Please confirm intent and either fix the comments or re-list the checks. Details plus two smaller notes (hardcoded namespace, first-container exec) are inline.
Non-blocking: catalog image is :latest (matches every existing entry, so out of scope here).
| # EFA libfabric stack already present on the parent EKS leaf. | ||
| # Deployment and conformance checks are inherited unchanged. | ||
| validation: | ||
| conformance: |
There was a problem hiding this comment.
This replaces the inherited conformance phase rather than adding to it — the merge swaps the whole phase pointer (s.Validation.Conformance = cloneValidationPhase(...) in pkg/recipe/metadata.go), so this leaf resolves to conformance = [slinky-slurm-health] only. The 10 checks from h100-eks-training (platform-health, gpu-operator-health, dra-support, secure-accelerator-access, …) are dropped.
Two points:
- The comment on L96 ("inherited unchanged") is no longer accurate for conformance.
- If dropping the K8s-native checks is intended (same rationale as clearing
performance.checks), please note in the PR that gpu-operator-health / dra-support no longer run on GPU slurm leaves. If they should stay, re-list them here — merge won't append.
Worth a case in pkg/recipe/conformance_test.go pinning the resolved set; the floor test only checks phase presence, so this drop is untested today.
| # gke-nccl-tcpxo. Deployment and conformance checks are inherited | ||
| # unchanged. | ||
| validation: | ||
| conformance: |
There was a problem hiding this comment.
Same as the eks overlay: this replaces the inherited conformance phase (10 checks) with just slinky-slurm-health, and the "inherited unchanged" comment above no longer holds. See the note on h100-eks-ubuntu-training-slurm.yaml.
|
|
||
| const ( | ||
| slinkySlurmComponent = "slinky-slurm" | ||
| slinkySlurmNamespace = "slurm" |
There was a problem hiding this comment.
Namespace is hardcoded, but the slinky-slurm component's namespace is a bundle-time-overridable default (registry defaultNamespace: slurm). If a user bundles into another namespace, the pod/CR lookups miss. The sibling platform-health check derives it from ctx.ValidationInput.ComponentRefs[].Namespace — doing the same here (fallback "slurm") would hold up under overrides.
| Namespace(namespace). | ||
| SubResource("exec"). | ||
| VersionedParams(&corev1.PodExecOptions{ | ||
| Container: pod.Spec.Containers[0].Name, |
There was a problem hiding this comment.
Targets the first container. Login pods can carry sidecars (munge/sssd/logging); if one is authored first, exec hits the wrong container — the test even models a {login, sidecar} pod and relies on login being index 0. Selecting the LoginSet's main container by name would be safer.
Summary
Adds the
slinky-slurm-healthconformance validator (pod-exec checks forscontrol ping, idle/mixed node inventory, and boundedsrun), wires it into slurm training leaves and the validator catalog, documents the EKS/GKE/Kind CUJ indemos/cuj1-slinky-slurm.md, and tunes the Slurm deployment Chainsaw assert timeout under theexpected-resourcesenvelope.Motivation / Context
Slinky Slurm leaves already deploy via
--platform slurm, but conformance had no workload-specific signal that slurmctld is reachable, workers are idle/mixed, and a minimal job can run through slurmd. This closes that gap and documents the full recipe → bundle → deploy → validate workflow (without snapshot) for EKS, GKE, and Kind.Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator,validators/conformance)pkg/errors,pkg/k8s)docs/,demos/)recipes/validators/catalog.yaml,recipes/checks/slinky-slurm/, slurm overlays,vendor/Implementation Notes
slinky-slurm-healthconformance check — discovers login/worker pods via Slinky LoginSet/NodeSet CR selectors, then execs from the login pod:scontrol ping,sinfo -h -Ne -t idle,mix | grep -q ., andsrun --immediate=5 --time=0:03 hostname. Shared pod-exec helper invalidators/conformance/pod_exec.go(pulls ink8s.io/client-go/tools/remotecommand+ vendored deps).slinky-slurm-healthadded to slurm leaves'validation.conformance.checks; inheritedperformance.checkscleared ([]) so K8s-nativenccl-all-reduce-bwdoes not bypass slurmd. Deployment checks (includingexpected-resources+ Chainsaw health check) remain inherited from parent training leaves.recipes/checks/slinky-slurm/health-check.yamlassertreduced from 10m → 7m to fit under the 8mexpected-resourcescatalog envelope; contributor note added indocs/contributor/validator.md.demos/cuj1-slinky-slurm.mdcovers Query Mode recipe generation, bundle scheduling overrides for EKS/GKE/Kind, validate phases, and manualsrunsmoke.Testing
unset GITLAB_TOKEN make qualifyUnit / CI:
make qualifypassed locally (tests with-race, lint, e2e, scan).Live cluster — conformance — built and pushed a custom
conformancevalidator image to the cluster registry, then pointedaicr validateat it:(Kind runs used the same pattern with Kind-appropriate
REG, recipe path, and tolerations/node-selector from the CUJ.)Other inherited conformance checks failed/skipped on Kind (no GPU/DRA) — expected.
Kind — healthy stack (
slinky-slurm-healthpassed):EKS — healthy worker (
slinky-slurm-healthpassed):EKS — unhealthy worker (
slinky-slurm-healthfailed as expected):Manual smoke:
srunhostname on the login pod (documented indemos/cuj1-slinky-slurm.md).Risk Assessment
slinky-slurmcomponent presence; slurm leaves only. Vendor bump is indirect deps for pod-exec. No change to default non-slurm recipes.Rollout notes: Requires rebuilding/pushing the
conformancevalidator image for live testing off main CI tags. Slurm leaves pick up the check via recipe merge; no migration.Checklist
make testwith-race)make lint)slinky_slurm_health_check_test.go,pod_exec_test.go)demos/cuj1-slinky-slurm.md,docs/user/component-catalog.md,docs/contributor/validator.md)git commit -S)