Skip to content

feat(slinky-slurm): add conformance validator and CUJ demo#1394

Open
kaynetu wants to merge 3 commits into
NVIDIA:mainfrom
kaynetu:feat/slinky-slurm-conformance-validator
Open

feat(slinky-slurm): add conformance validator and CUJ demo#1394
kaynetu wants to merge 3 commits into
NVIDIA:mainfrom
kaynetu:feat/slinky-slurm-conformance-validator

Conversation

@kaynetu

@kaynetu kaynetu commented Jun 17, 2026

Copy link
Copy Markdown

Summary

Adds the slinky-slurm-health conformance validator (pod-exec checks for scontrol ping, idle/mixed node inventory, and bounded srun), wires it into slurm training leaves and the validator catalog, documents the EKS/GKE/Kind CUJ in demos/cuj1-slinky-slurm.md, and tunes the Slurm deployment Chainsaw assert timeout under the expected-resources envelope.

Motivation / Context

Slinky Slurm leaves already deploy via --platform slurm, but conformance had no workload-specific signal that slurmctld is reachable, workers are idle/mixed, and a minimal job can run through slurmd. This closes that gap and documents the full recipe → bundle → deploy → validate workflow (without snapshot) for EKS, GKE, and Kind.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator, validators/conformance)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, demos/)
  • Other: recipes/validators/catalog.yaml, recipes/checks/slinky-slurm/, slurm overlays, vendor/

Implementation Notes

  • slinky-slurm-health conformance check — discovers login/worker pods via Slinky LoginSet/NodeSet CR selectors, then execs from the login pod: scontrol ping, sinfo -h -Ne -t idle,mix | grep -q ., and srun --immediate=5 --time=0:03 hostname. Shared pod-exec helper in validators/conformance/pod_exec.go (pulls in k8s.io/client-go/tools/remotecommand + vendored deps).
  • Recipe wiringslinky-slurm-health added to slurm leaves' validation.conformance.checks; inherited performance.checks cleared ([]) so K8s-native nccl-all-reduce-bw does not bypass slurmd. Deployment checks (including expected-resources + Chainsaw health check) remain inherited from parent training leaves.
  • Deployment Chainsaw timeoutrecipes/checks/slinky-slurm/health-check.yaml assert reduced from 10m → 7m to fit under the 8m expected-resources catalog envelope; contributor note added in docs/contributor/validator.md.
  • CUJdemos/cuj1-slinky-slurm.md covers Query Mode recipe generation, bundle scheduling overrides for EKS/GKE/Kind, validate phases, and manual srun smoke.

Testing

unset GITLAB_TOKEN
make qualify

Unit / CI: make qualify passed locally (tests with -race, lint, e2e, scan).

Live cluster — conformance — built and pushed a custom conformance validator image to the cluster registry, then pointed aicr validate at it:

COMMIT=$(git rev-parse HEAD)
REG=<cluster-registry>   # ECR for EKS; local/registry mirror for Kind as documented in CUJ

make image-validators IMAGE_REGISTRY=${REG} IMAGE_TAG=sha-${COMMIT}

AICR_VALIDATOR_IMAGE_REGISTRY=${REG} \
  AICR_VALIDATOR_IMAGE_TAG=sha-${COMMIT} \
  ./dist/aicr_darwin_arm64_v8.0/aicr validate \
    -r /tmp/aws-slurm-recipe.yaml \
    --toleration dedicated=worker-workload:NoSchedule \
    --toleration dedicated=worker-workload:NoExecute \
    --phase conformance -o /tmp/report4.json \
    --no-cleanup

(Kind runs used the same pattern with Kind-appropriate REG, recipe path, and tolerations/node-selector from the CUJ.)

Other inherited conformance checks failed/skipped on Kind (no GPU/DRA) — expected.

Kind — healthy stack (slinky-slurm-health passed):

slinky-slurm-worker-slinky-0   ready=1/1 phase=Running
...
stdout: Slurmctld(primary) at slinky-slurm-controller-0 is UP
stdout: PARTITION ... NODES STATE ... idle slinky-0
stdout: slinky-0          # srun hostname
status: passed

EKS — healthy worker (slinky-slurm-health passed):

slinky-slurm-worker-slinky-0   ready=1/1 phase=Running
scontrol ping  ExitCode: 0   → Slurmctld(primary) at slinky-slurm-controller-0 is UP
sinfo idle/mix ExitCode: 0
srun hostname  ExitCode: 0   → slinky-0
status: passed

EKS — unhealthy worker (slinky-slurm-health failed as expected):

slinky-slurm-worker-slinky-0   ready=0/1 phase=Running
scontrol ping  ExitCode: 0   → Slurmctld(primary) at slinky-slurm-controller-0 is UP
sinfo idle/mix ExitCode: 1
srun hostname  ExitCode: 1
stderr: srun: Unable to allocate resources: Requested nodes are busy
message: Slinky Slurm health commands failed: sinfo idle/mix: exit code 1; srun hostname: exit code 1
status: failed

Manual smoke: srun hostname on the login pod (documented in demos/cuj1-slinky-slurm.md).

Risk Assessment

  • Low — New conformance check is gated on slinky-slurm component presence; slurm leaves only. Vendor bump is indirect deps for pod-exec. No change to default non-slurm recipes.
  • Medium
  • High

Rollout notes: Requires rebuilding/pushing the conformance validator image for live testing off main CI tags. Slurm leaves pick up the check via recipe merge; no migration.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (slinky_slurm_health_check_test.go, pod_exec_test.go)
  • I updated docs if user-facing behavior changed (demos/cuj1-slinky-slurm.md, docs/user/component-catalog.md, docs/contributor/validator.md)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@kaynetu kaynetu requested review from a team as code owners June 17, 2026 17:35
@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown
Contributor

Welcome to AICR, @kaynetu! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c33a6029-7d28-4ab9-bf5a-7eabcb5e7fc2

📥 Commits

Reviewing files that changed from the base of the PR and between 9999034 and 380c582.

📒 Files selected for processing (3)
  • demos/cuj1-slinky-slurm.md
  • validators/conformance/slinky_slurm_health_check.go
  • validators/conformance/slinky_slurm_health_check_test.go

📝 Walkthrough

Walkthrough

This PR introduces a slinky-slurm-health conformance validator for Kubernetes Slurm clusters managed by the Slinky operator. It adds a pod_exec.go utility using SPDY remotecommand to exec commands inside pods, and slinky_slurm_health_check.go implementing CheckSlinkySlurmHealth which gates on component presence, discovers slinky.slurm.net/v1beta1 APIs, skips KWOK virtual clusters, selects a ready LoginSet pod, and runs scontrol ping, sinfo, and srun hostname checks. The new check is wired into the validator catalog, the conformance dispatch map, and three Slurm recipe overlays (EKS, GKE, Kind). A health-check timeout is reduced from 10m to 7m. Three indirect Go dependencies are added for SPDY transport. A 279-line CUJ1 demo walkthrough, a timeout-budgeting contributor doc section, component catalog cross-link, and Apache 2.0 license headers on two evidence files are also included.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

area/tests, area/docs, theme/validation

Suggested reviewers

  • mchmarny
  • njhensley
  • xdu31
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding a conformance validator and CUJ demo for slinky-slurm, which are the primary deliverables across all modified files.
Description check ✅ Passed The description comprehensively explains the changes, motivation, implementation, testing, and risk assessment, all directly related to the changeset.
Description check ✅ Passed Documentation files (demos/cuj1-slinky-slurm.md, docs/contributor/validator.md, docs/user/component-catalog.md) are updated appropriately to support the new functionality.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@demos/cuj1-slinky-slurm.md`:
- Line 11: The Markdown spans on line 11 and line 51 have malformed inline code
with unmatched bold markers. The pattern `**nodeGroup**` incorrectly places bold
delimiters inside backticks creating invalid Markdown. Fix this by either
removing the double asterisks and keeping only the backtick code formatting as
`nodeGroup`, or by repositioning the bold markers outside the backticks to
render as **`nodeGroup`** if bold emphasis is intended.
- Around line 154-158: Update the documented srun command timeout in the
conformance phase row of the table to match the actual implementation. Change
the timeout value from --time=0:01 to --time=0:03 in the conformance row to
align with what CheckSlinkySlurmHealth in
validators/conformance/slinky_slurm_health_check.go actually executes, ensuring
the documentation accurately reflects the validator's behavior.

In `@validators/conformance/slinky_slurm_health_check.go`:
- Around line 263-316: The controllerRef field extraction in the
listSlinkySetsForController function at lines 287-288 discards error and found
return values from NestedString calls, which could hide type mismatches or other
errors. Refactor the controllerName and controllerNamespace extraction to
capture all three return values from NestedString instead of using blanks, then
add explicit error checking similar to the selector extraction pattern at lines
295-299. Check if the error is not nil and return an appropriate error, and
optionally verify that the fields are found before proceeding with the
controller name comparison logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d5da9b2a-04d2-4b28-ba2a-41fdfe830cba

📥 Commits

Reviewing files that changed from the base of the PR and between eea2a81 and 9999034.

⛔ Files ignored due to path filters (79)
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/gorilla/websocket/.gitignore is excluded by !vendor/**
  • vendor/github.com/gorilla/websocket/AUTHORS is excluded by !vendor/**
  • vendor/github.com/gorilla/websocket/LICENSE is excluded by !vendor/**
  • vendor/github.com/gorilla/websocket/README.md is excluded by !vendor/**
  • vendor/github.com/gorilla/websocket/client.go is excluded by !vendor/**
  • vendor/github.com/gorilla/websocket/compression.go is excluded by !vendor/**
  • vendor/github.com/gorilla/websocket/conn.go is excluded by !vendor/**
  • vendor/github.com/gorilla/websocket/doc.go is excluded by !vendor/**
  • vendor/github.com/gorilla/websocket/join.go is excluded by !vendor/**
  • vendor/github.com/gorilla/websocket/json.go is excluded by !vendor/**
  • vendor/github.com/gorilla/websocket/mask.go is excluded by !vendor/**
  • vendor/github.com/gorilla/websocket/mask_safe.go is excluded by !vendor/**
  • vendor/github.com/gorilla/websocket/prepared.go is excluded by !vendor/**
  • vendor/github.com/gorilla/websocket/proxy.go is excluded by !vendor/**
  • vendor/github.com/gorilla/websocket/server.go is excluded by !vendor/**
  • vendor/github.com/gorilla/websocket/util.go is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/CONTRIBUTING.md is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/LICENSE is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/MAINTAINERS is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/NOTICE is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/README.md is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/connection.go is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/handlers.go is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/priority.go is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/spdy/LICENSE is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/spdy/PATENTS is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/spdy/dictionary.go is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/spdy/options.go is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/spdy/read.go is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/spdy/types.go is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/spdy/write.go is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/stream.go is excluded by !vendor/**
  • vendor/github.com/moby/spdystream/utils.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socks/client.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socks/socks.go is excluded by !vendor/**
  • vendor/golang.org/x/net/proxy/dial.go is excluded by !vendor/**
  • vendor/golang.org/x/net/proxy/direct.go is excluded by !vendor/**
  • vendor/golang.org/x/net/proxy/per_host.go is excluded by !vendor/**
  • vendor/golang.org/x/net/proxy/proxy.go is excluded by !vendor/**
  • vendor/golang.org/x/net/proxy/socks5.go is excluded by !vendor/**
  • vendor/golang.org/x/net/websocket/client.go is excluded by !vendor/**
  • vendor/golang.org/x/net/websocket/dial.go is excluded by !vendor/**
  • vendor/golang.org/x/net/websocket/hybi.go is excluded by !vendor/**
  • vendor/golang.org/x/net/websocket/server.go is excluded by !vendor/**
  • vendor/golang.org/x/net/websocket/websocket.go is excluded by !vendor/**
  • vendor/k8s.io/apimachinery/pkg/util/httpstream/doc.go is excluded by !vendor/**
  • vendor/k8s.io/apimachinery/pkg/util/httpstream/httpstream.go is excluded by !vendor/**
  • vendor/k8s.io/apimachinery/pkg/util/httpstream/spdy/doc.go is excluded by !vendor/**
  • vendor/k8s.io/apimachinery/pkg/util/httpstream/spdy/spdy.go is excluded by !vendor/**
  • vendor/k8s.io/apimachinery/pkg/util/remotecommand/constants.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/tools/remotecommand/OWNERS is excluded by !vendor/**
  • vendor/k8s.io/client-go/tools/remotecommand/doc.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/tools/remotecommand/errorstream.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/tools/remotecommand/fallback.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/tools/remotecommand/reader.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/tools/remotecommand/remotecommand.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/tools/remotecommand/resize.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/tools/remotecommand/spdy.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/tools/remotecommand/v1.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/tools/remotecommand/v2.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/tools/remotecommand/v3.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/tools/remotecommand/v4.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/tools/remotecommand/v5.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/tools/remotecommand/websocket.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/transport/spdy/spdy.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/transport/websocket/roundtripper.go is excluded by !vendor/**
  • vendor/k8s.io/client-go/util/exec/exec.go is excluded by !vendor/**
  • vendor/k8s.io/streaming/LICENSE is excluded by !vendor/**
  • vendor/k8s.io/streaming/pkg/httpstream/doc.go is excluded by !vendor/**
  • vendor/k8s.io/streaming/pkg/httpstream/httpstream.go is excluded by !vendor/**
  • vendor/k8s.io/streaming/pkg/httpstream/spdy/connection.go is excluded by !vendor/**
  • vendor/k8s.io/streaming/pkg/httpstream/spdy/roundtripper.go is excluded by !vendor/**
  • vendor/k8s.io/streaming/pkg/httpstream/spdy/upgrade.go is excluded by !vendor/**
  • vendor/k8s.io/streaming/pkg/httpstream/wsstream/conn.go is excluded by !vendor/**
  • vendor/k8s.io/streaming/pkg/httpstream/wsstream/doc.go is excluded by !vendor/**
  • vendor/k8s.io/streaming/pkg/httpstream/wsstream/stream.go is excluded by !vendor/**
  • vendor/k8s.io/streaming/pkg/runtime/runtime.go is excluded by !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
📒 Files selected for processing (17)
  • demos/README.md
  • demos/cuj1-slinky-slurm.md
  • docs/contributor/validator.md
  • docs/user/component-catalog.md
  • go.mod
  • recipes/checks/slinky-slurm/health-check.yaml
  • recipes/evidence/gb200-eks-ubuntu-training.yaml
  • recipes/evidence/h100-gke-cos-training.yaml
  • recipes/overlays/h100-eks-ubuntu-training-slurm.yaml
  • recipes/overlays/h100-gke-cos-training-slurm.yaml
  • recipes/overlays/h100-kind-training-slurm.yaml
  • recipes/validators/catalog.yaml
  • validators/conformance/main.go
  • validators/conformance/pod_exec.go
  • validators/conformance/pod_exec_test.go
  • validators/conformance/slinky_slurm_health_check.go
  • validators/conformance/slinky_slurm_health_check_test.go

Comment thread demos/cuj1-slinky-slurm.md Outdated
Comment thread demos/cuj1-slinky-slurm.md
Comment thread validators/conformance/slinky_slurm_health_check.go
@github-actions

Copy link
Copy Markdown
Contributor

Recipe evidence check

Affected leaf overlays: 5

Recipe Pointer Verify Digest match
gb200-eks-ubuntu-training ✅ present ✅ passed ⚠️ stale (9aeea19f5b75… vs current 07a8ff1e9d4b…)
h100-eks-ubuntu-training-slurm ⚠️ missing
h100-gke-cos-training-slurm ⚠️ missing
h100-gke-cos-training ✅ present ✅ passed ⚠️ stale (b82cade51fd2… vs current 064f27fa945d…)
h100-kind-training-slurm ⚠️ missing

How to refresh evidence

Run on a cluster matching the recipe's criteria:

aicr snapshot -o snapshot.yaml
aicr validate \
  -r recipes/overlays/<slug>.yaml \
  -s snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yaml

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

@github-actions

Copy link
Copy Markdown
Contributor

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, well-tested addition. The check follows project patterns closely — pkg/errors with codes, fail-closed validators.Skip on missing Slinky API and KWOK-only nodes, test-injectable seams (newPodExecExecutor, slinkyExecCommand), and genuinely thorough table-driven tests (multi-failure aggregation, malformed controllerRef, custom CR selectors). Catalog 5m timeout and the 10m→7m assert trim under the 8m envelope are reasonable and well-documented in validator.md.

Two non-blocking comments inline: one medium robustness concern on first-container exec selection (pod_exec.go), one nit on a near-duplicate component-presence helper. Neither blocks merge.

CI: ClamAV + grype green; Tier-1 e2e still running at review time — worth a glance before merge.

Namespace(namespace).
SubResource("exec").
VersionedParams(&corev1.PodExecOptions{
Container: pod.Spec.Containers[0].Name,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Containers[0].Name blindly picks the first container. Slinky login pods commonly carry sidecars (munge/sssd/logging) — and this PR's own pod_exec_test.go models a 2-container pod with a sidecar. If the slurm client container isn't index 0, exec lands in the wrong container and the check fails spuriously (scontrol/srun not found → nonzero exit).

It fails closed (false-negative, not a spurious pass), so not a blocker — but consider honoring the kubectl.kubernetes.io/default-container annotation (Slinky sets it on the login pod) or matching a known container name, with index-0 as fallback.

return nil
}

func recipeHasEnabledComponent(ctx *validators.Context, name string) bool {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this near-duplicates recipeHasComponent (robust_controller_check.go:56) — the only difference is the added IsEnabled() gate. Two almost-identical component-presence helpers in the same package will confuse the next reader. Consider consolidating into one; the IsEnabled() semantics here are arguably the more correct behavior for the other callers too.

@ArangoGutierrez ArangoGutierrez left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, well-tested validator — clean exec seam, exact-message assertions, KWOK skip, and selector-based pod discovery. go test ./validators/conformance/... is green locally.

One thing to settle before merge: the slurm overlays declare a validation.conformance.checks block, and the recipe merge replaces the inherited conformance phase rather than appending to it (pkg/recipe/metadata.go: s.Validation.Conformance = cloneValidationPhase(other.Validation.Conformance)). Net effect — each slurm leaf resolves to conformance = [slinky-slurm-health] only, dropping the 10 checks inherited from the eks/gke training parents (platform-health, gpu-operator-health, dra-support, secure-accelerator-access, …). That may be intended, but the overlay comments and the PR testing notes ("other inherited conformance checks failed/skipped on Kind — expected") read as if the checks are additive, which they aren't. Please confirm intent and either fix the comments or re-list the checks. Details plus two smaller notes (hardcoded namespace, first-container exec) are inline.

Non-blocking: catalog image is :latest (matches every existing entry, so out of scope here).

# EFA libfabric stack already present on the parent EKS leaf.
# Deployment and conformance checks are inherited unchanged.
validation:
conformance:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces the inherited conformance phase rather than adding to it — the merge swaps the whole phase pointer (s.Validation.Conformance = cloneValidationPhase(...) in pkg/recipe/metadata.go), so this leaf resolves to conformance = [slinky-slurm-health] only. The 10 checks from h100-eks-training (platform-health, gpu-operator-health, dra-support, secure-accelerator-access, …) are dropped.

Two points:

  • The comment on L96 ("inherited unchanged") is no longer accurate for conformance.
  • If dropping the K8s-native checks is intended (same rationale as clearing performance.checks), please note in the PR that gpu-operator-health / dra-support no longer run on GPU slurm leaves. If they should stay, re-list them here — merge won't append.

Worth a case in pkg/recipe/conformance_test.go pinning the resolved set; the floor test only checks phase presence, so this drop is untested today.

# gke-nccl-tcpxo. Deployment and conformance checks are inherited
# unchanged.
validation:
conformance:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the eks overlay: this replaces the inherited conformance phase (10 checks) with just slinky-slurm-health, and the "inherited unchanged" comment above no longer holds. See the note on h100-eks-ubuntu-training-slurm.yaml.


const (
slinkySlurmComponent = "slinky-slurm"
slinkySlurmNamespace = "slurm"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespace is hardcoded, but the slinky-slurm component's namespace is a bundle-time-overridable default (registry defaultNamespace: slurm). If a user bundles into another namespace, the pod/CR lookups miss. The sibling platform-health check derives it from ctx.ValidationInput.ComponentRefs[].Namespace — doing the same here (fallback "slurm") would hold up under overrides.

Namespace(namespace).
SubResource("exec").
VersionedParams(&corev1.PodExecOptions{
Container: pod.Spec.Containers[0].Name,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Targets the first container. Login pods can carry sidecars (munge/sssd/logging); if one is authored first, exec hits the wrong container — the test even models a {login, sidecar} pod and relies on login being index 0. Selecting the LoginSet's main container by name would be safer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants