NVIDIA · mchmarny · Jun 22, 2026 · Jun 16, 2026 · Jun 17, 2026 · Jun 22, 2026
@@ -7,6 +7,7 @@ Runbooks for testing and demonstrating AICR end-to-end workflows on live cluster
 | Demo | Description |
 |------|-------------|
 | [cuj1-training.md](cuj1-training.md) | CUJ1 (training) - EKS + GKE end-to-end, plus a config-driven GKE + signed-evidence variant |
+| [cuj1-slinky-slurm.md](cuj1-slinky-slurm.md) | CUJ1 - Slinky Slurm on EKS / GKE / Kind (recipe → bundle → validate → `srun`) |
 | [cuj2-inference.md](cuj2-inference.md) | CUJ2 (inference) - EKS + GKE end-to-end with the Dynamo platform |
 | [cuj2-demo.md](cuj2-demo.md) | CUJ2 (inference) - Annotated slide-style demo walkthrough (training vs inference) |
 | [recipe-data-architecture.md](recipe-data-architecture.md) | Recipe metadata system: inheritance, criteria matching, deployment order, runtime external data |

@@ -0,0 +1,279 @@
+# AICR - Critical User Journey (CUJ) 1 — Slinky Slurm
+
+End-to-end walkthrough: **generate recipe (Query Mode) → bundle → deploy → validate → `srun` smoke job**.
+
+Slurm leaves are built from criteria flags (`--service`, `--platform slurm`, …), not from `aicr snapshot` — snapshot intake for Slurm is not supported today. See [Query Mode](../docs/user/cli-reference.md#aicr-recipe) in the CLI reference.
+
+## Assumptions
+
+- `kubectl` is configured for the target cluster.
+- GPU leaves assume H100 nodes with drivers (or Kind for the CPU-only path).
+- Node pools use a `nodeGroup` label (adjust if your cluster uses different keys).
+- Inspect taints before bundling: `kubectl get nodes -o custom-columns=NAME:.metadata.name,GROUP:.metadata.labels.nodeGroup,TAINTS:.spec.taints`
+
+## Workflow
+
+```text
+  aicr recipe          aicr bundle        ./deploy.sh       aicr validate       srun smoke
+  (Query Mode)    ──▶  (scheduling)  ──▶  (install)    ──▶  (phases)       ──▶  (manual)
+```
+
+1. **Generate recipe (Query Mode)** — `aicr recipe --service … --platform slurm` resolves a slurm leaf overlay to `recipe.yaml`.
+2. **Generate bundle** — apply `--system-*` / `--accelerated-*` scheduling and optional `--set` / `--set-json` on `slinkyslurm`.
+3. **Install** — run `deploy.sh`; cert-manager and Slinky operator come up, then the cluster chart in `slurm`.
+4. **Validate** — run `deployment` (Chainsaw component health) and `conformance` (`slinky-slurm-health` from the login pod). **Performance validation is not supported yet** on slurm leaves.
+5. **Smoke job** — `kubectl exec` into the login pod and run `srun` to confirm scheduling.
+
+## Generate Recipe (Query Mode)
+
+Pick the row that matches your cluster. Each resolves to a slurm leaf with three inline Slinky components: `slinky-slurm-operator-crds`, `slinky-slurm-operator`, and `slinky-slurm`.
+
+
+| Cloud    | Command                                                                                                      | Leaf overlay                                               |
+| -------- | ------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------- |
+| **EKS**  | `aicr recipe --service eks --accelerator h100 --intent training --os ubuntu --platform slurm -o recipe.yaml` | `h100-eks-ubuntu-training-slurm`                           |
+| **GKE**  | `aicr recipe --service gke --accelerator h100 --intent training --os cos --platform slurm -o recipe.yaml`    | `h100-gke-cos-training-slurm`                              |
+| **Kind** | `aicr recipe --service kind --accelerator h100 --intent training --platform slurm -o recipe.yaml`            | `h100-kind-training-slurm` (CPU-only NodeSet; no GPU GRES) |
+
+
+H100 cloud leaves bake in `Gres=gpu:h100:8` and matching `nvidia.com/gpu: 8` slurmd limits so `srun --gres=gpu:N` works after deploy.
+
+## Generate Bundle
+
+### Scheduling model
+
+AICR injects placement from bundle flags using each component's registry paths:
+
+
+| Flag                                                            | Typical targets                                     |
+| --------------------------------------------------------------- | --------------------------------------------------- |
+| `--system-node-selector` / `--system-node-toleration`           | cert-manager, **slurm-operator**, prometheus, …     |
+| `--accelerated-node-selector` / `--accelerated-node-toleration` | `nodesets.slinky` (slurmd workers)                  |
+| `--set-json slinkyslurm:…`                                      | Per-leaf overrides on the cluster chart (see below) |
+
+
+**Registry default for `slinky-slurm`:** `controller`, `restapi`, and `loginsets.slinky` use the **system** paths; `nodesets.slinky` uses **accelerated** paths. On split clusters (system pool + GPU pool), override the control plane onto the pool you want with `--set-json` (runs **after** selector injection and wins on those paths).
+
+**Operator note:** slurm-operator chart v1.1.0 ignores `nodeSelector`; it schedules from **tolerations** only. On EKS, include **both** `NoSchedule` and `NoExecute` for each taint key — nodes often carry both effects.
+
+**Override aliases:** `slinkyslurm`, `slurmcluster` (cluster chart); `slurm`, `slurmoperator` (operator chart). See `valueOverrideKeys` in `recipes/registry.yaml`.
+
+**Scalar vs structured overrides:**
+
+- `--set slinkyslurm:nodesets.slinky.replicas=2` — replicas, simple scalars.
+- `--set-json slinkyslurm:controller.podSpec=…` — full `nodeSelector` / `tolerations` objects (required when overriding system-injected scheduling on control-plane paths).
+
+### EKS (dual taints: `system-workload` / `worker-workload`)
+
+Example layout: 3× `system-worker`, 1× `gpu-worker`. Operator + platform stack on system nodes; slurmd on GPU; controller / login / restapi pinned to GPU via `--set-json`.
+
+```shell
+WORKER_TOLS='[{"key":"dedicated","operator":"Equal","value":"worker-workload","effect":"NoSchedule"},{"key":"dedicated","operator":"Equal","value":"worker-workload","effect":"NoExecute"}]'
+
+aicr bundle \
+  --recipe recipe.yaml \
+  --deployer helm \
+  --system-node-selector nodeGroup=system-worker \
+  --system-node-toleration dedicated=system-workload:NoSchedule \
+  --system-node-toleration dedicated=system-workload:NoExecute \
+  --accelerated-node-selector nodeGroup=gpu-worker \
+  --accelerated-node-toleration dedicated=worker-workload:NoSchedule \
+  --accelerated-node-toleration dedicated=worker-workload:NoExecute \
+  --storage-class <storage-class> \
+  --set slinkyslurm:nodesets.slinky.replicas=1 \
+  --set-json "slinkyslurm:controller.podSpec={\"nodeSelector\":{\"nodeGroup\":\"gpu-worker\"},\"tolerations\":${WORKER_TOLS}}" \
+  --set-json "slinkyslurm:restapi.podSpec={\"nodeSelector\":{\"nodeGroup\":\"gpu-worker\"},\"tolerations\":${WORKER_TOLS}}" \
+  --set-json "slinkyslurm:loginsets.slinky.podSpec={\"nodeSelector\":{\"nodeGroup\":\"gpu-worker\"},\"tolerations\":${WORKER_TOLS}}" \
+  --output bundle
+```
+
+Set `replicas` to your GPU node count when you have multiple workers.
+
+### GKE (system + cpu + gpu pools; GPU taint only)
+
+Example layout: 3× `system-worker` (no taints), 1× `cpu-worker` (no taints), 2× `gpu-worker` (`dedicated=gpu-workload:NoSchedule`). Control plane on **cpu-worker**; slurmd on **gpu-worker**.
+
+```shell
+aicr bundle \
+  --recipe recipe.yaml \
+  --deployer helm \
+  --system-node-selector nodeGroup=system-worker \
+  --accelerated-node-selector nodeGroup=gpu-worker \
+  --accelerated-node-toleration dedicated=gpu-workload:NoSchedule \
+  --storage-class <storage-class> \
+  --set slinkyslurm:nodesets.slinky.replicas=2 \
+  --set-json 'slinkyslurm:controller.podSpec={"nodeSelector":{"nodeGroup":"cpu-worker"}}' \
+  --set-json 'slinkyslurm:restapi.podSpec={"nodeSelector":{"nodeGroup":"cpu-worker"}}' \
+  --set-json 'slinkyslurm:loginsets.slinky.podSpec={"nodeSelector":{"nodeGroup":"cpu-worker"}}' \
+  --output bundle
+```
+
+GKE system nodes should **not** carry custom taints (konnectivity and other managed pods break). No `--system-node-toleration` on GKE when system/cpu pools are untainted.
+
+Optional: `--accelerated-node-toleration nvidia.com/gpu=present:NoSchedule` (harmless if that taint is absent).
+
+### Kind (CPU-only smoke / CI)
+
+No GPU pools or taints; omit accelerated flags unless your Kind config adds them.
+
+```shell
+aicr bundle \
+  --recipe recipe.yaml \
+  --deployer helm \
+  --output bundle
+```
+
+For automated no-GPU checks, see `make kwok-e2e` / `make check-health COMPONENT=slinky-slurm` in the repo Makefile.
+
+### Storage class
+
+Set `--storage-class` to a StorageClass that exists (`kubectl get storageclass`). The kube-prometheus-stack overlay uses a `volumeClaimTemplate` without a default `storageClassName`; a missing/default SC leaves PVCs Pending.
+
+## Install Bundle
+
+```shell
+cd ./bundle && chmod +x deploy.sh && ./deploy.sh
+```
+
+Deploy order: `cert-manager` → `slinky-slurm-operator-crds` → `slinky-slurm-operator` → `slinky-slurm`.
+
+```shell
+kubectl rollout status -n slinky deploy/slurm-operator
+kubectl get pods -n slurm
+kubectl wait --for=jsonpath='{.status.conditions[?(@.type=="Available")].status}'=True \
+  -n slurm deploy/slinky-slurm-login-slinky --timeout=10m
+```
+
+If nodewright is already installed, skip those sections in `deploy.sh` to avoid upgrade conflicts.
+
+## Validate Cluster
+
+Use **deployment** and **conformance**. Performance validation is **not supported yet** on slurm leaves — there is no Slurm-native NCCL (or equivalent) check in AICR today; a K8s Pod benchmark would bypass slurmd and is the wrong path on a Slinky-managed cluster.
+
+
+| Phase         | What it checks                                                                                                         |
+| ------------- | ---------------------------------------------------------------------------------------------------------------------- |
+| `deployment`  | Component Chainsaw health (CRs, Deployments, DaemonSets ready), including `slinky-slurm` readiness (long retry budget) |
+| `conformance` | `slinky-slurm-health`: `scontrol ping`, idle/mix node gate, bounded `srun --immediate=5 --time=0:03 hostname`          |
+| `performance` | **Not supported yet** on slurm leaves                                                                                  |
+| `all`         | Runs deployment → conformance → performance in sequence; the performance step has nothing to run on slurm leaves       |
+
+
+### All phases
+
+```shell
+aicr validate \
+  --recipe recipe.yaml \
+  --phase all \
+  --output report.json
+```
+
+Prefer `--phase deployment --phase conformance` when you only want the supported checks.
+
+### Specific phases
+
+```shell
+# After deploy.sh — component + CR readiness (Chainsaw)
+aicr validate \
+  --recipe recipe.yaml \
+  --phase deployment \
+  --output report-deployment.json
+
+# Slurm behavior from login pod (conformance Job)
+aicr validate \
+  --recipe recipe.yaml \
+  --phase conformance \
+  --output report-conformance.json
+
+# Both — common after install
+aicr validate \
+  --recipe recipe.yaml \
+  --phase deployment \
+  --phase conformance \
+  --output report.json
+```
+
+### Scheduling flags on validate
+
+When validate captures cluster state inline (no `-s`), pass `--node-selector` and `--toleration` so the snapshot agent Job can schedule on tainted nodes. Match your **system** pool (not the GPU pool) unless you intend to run the agent on GPU nodes.
+
+**EKS example** (agent on system nodes):
+
+```shell
+aicr validate \
+  --recipe recipe.yaml \
+  --node-selector nodeGroup=system-worker \
+  --toleration dedicated=system-workload:NoSchedule \
+  --toleration dedicated=system-workload:NoExecute \
+  --phase deployment \
+  --phase conformance \
+  --output report.json
+```
+
+**GKE example** (untainted system pool; `--toleration` optional):
+
+```shell
+aicr validate \
+  --recipe recipe.yaml \
+  --node-selector nodeGroup=system-worker \
+  --toleration dedicated=gpu-workload:NoSchedule \
+  --phase deployment \
+  --phase conformance \
+  --output report.json
+```
+
+`--toleration` on validate applies to inner conformance/deployment Jobs; pair it with `--node-selector` when the default GPU auto-selector (`nvidia.com/gpu.present=true`) would land on tainted nodes you cannot tolerate.
+
+Readiness constraints (K8s version, OS, …) still run before any phase; they use measurements from the inline capture path above.
+
+## Run Job
+
+SSH is disabled by default on the login chart; use `kubectl exec`.
+
+```shell
+kubectl exec -n slurm deploy/slinky-slurm-login-slinky -- sinfo
+kubectl exec -n slurm deploy/slinky-slurm-login-slinky -- \
+  srun --immediate=5 --time=0:03 hostname
+```
+
+Multi-node (when `replicas >= 2`):
+
+```shell
+kubectl exec -n slurm deploy/slinky-slurm-login-slinky -- srun -N2 hostname
+```
+
+GPU GRES smoke (H100 cloud leaves):
+
+```shell
+kubectl exec -n slurm deploy/slinky-slurm-login-slinky -- \
+  sh -c 'srun -N2 --gres=gpu:8 nvidia-smi -L | sort -u | wc -l'
+```
+
+## Cleanup
+
+Cluster instance only (keep operator + CRDs):
+
+```shell
+helm uninstall slinky-slurm -n slurm
+```
+
+Full Slurm stack:
+
+```shell
+helm uninstall slinky-slurm -n slurm
+helm uninstall slinky-slurm-operator -n slinky
+helm uninstall slinky-slurm-operator-crds -n slinky
+kubectl delete ns slurm slinky --ignore-not-found
+```
+
+Helm does not remove CRDs or PVCs by default; delete manually when you need a clean re-install.
+
+## Success
+
+- `deployment` + `conformance` phases pass in the CTRF report.
+- `sinfo` shows NodeSet nodes idle.
+- `srun hostname` returns worker hostnames.
+- On GPU leaves, `srun --gres=gpu:8 nvidia-smi -L` reaches all GPUs per node.
+
+> Multi-node NCCL via `srun` + Pyxis/Enroot is the natural Slurm-native performance path; it is out of scope for this smoke CUJ and not covered by `aicr validate --phase performance` today.
+
@@ -734,6 +734,28 @@ make check-health-all                      # everything in recipes/checks/
 make validate-local RECIPE=recipe.yaml     # full pipeline in Kind
 ```
 
+### Timeout budgeting
+
+During `aicr validate --phase deployment`, registry health checks in
+`recipes/checks/<component>/health-check.yaml` run in-process inside
+the `expected-resources` check (`validators/chainsaw/inprocess.go`).
+
+A Test's `spec.timeouts.assert` is the **whole-Test budget** — one
+deadline shared across every step and retry. Slurm's
+[`health-check.yaml`](https://github.com/NVIDIA/aicr/blob/main/recipes/checks/slinky-slurm/health-check.yaml)
+uses `assert: 7m` so workload-readiness steps can converge before the
+pod-phase guard runs.
+
+The `expected-resources` catalog timeout (8m in
+`recipes/validators/catalog.yaml`) is the **outer** envelope. It must
+exceed the longest in-tree `assert` value plus headroom for
+pre-chainsaw work, chainsaw teardown, and log flush
+(`defaults.JobEnvelopeMargin`). If assert runs too close to that
+catalog deadline, the Job can SIGKILL the pod before chainsaw reports
+the failing step — operators see truncated output instead of a useful
+failure. Raise the catalog `timeout` in tandem when you need a longer
+assert budget (`TestExpectedResourcesCatalogEnvelope` guards this).
+
 ## Constraint evaluation algorithm
 
 `pkg/constraints` is shared by surface 1, surface 2's recipe

@@ -47,7 +47,7 @@ Not every component appears in every recipe. The recipe engine selects component
 - **Base components** (cert-manager, kube-prometheus-stack) appear in most recipes.
 - **Cloud-specific components** (aws-efa, aws-ebs-csi-driver) are added when the service matches.
 - **Intent-specific components** (agentgateway, agentgateway-crds) are added based on workload intent (e.g., inference recipes include the inference gateway).
-- **Platform-specific components** (slinky-slurm-operator, slinky-slurm, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching `--platform`. For `--platform slurm`, all three Slinky pieces (`slinky-slurm-operator-crds`, `slinky-slurm-operator`, `slinky-slurm`) are declared inline per slurm leaf overlay — the same shape `dynamo-platform` uses across `*-inference-dynamo` leaves. Leaves that want the operator only inline the CRDs + operator and omit the `slinky-slurm` componentRef.
+- **Platform-specific components** (slinky-slurm-operator, slinky-slurm, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching `--platform`. For `--platform slurm`, all three Slinky pieces (`slinky-slurm-operator-crds`, `slinky-slurm-operator`, `slinky-slurm`) are declared inline per slurm leaf overlay — the same shape `dynamo-platform` uses across `*-inference-dynamo` leaves. Leaves that want the operator only inline the CRDs + operator and omit the `slinky-slurm` componentRef. For an end-to-end walkthrough (recipe → bundle → install → validate → `srun` smoke job on EKS, GKE, or Kind), see [`demos/cuj1-slinky-slurm.md`](https://github.com/NVIDIA/aicr/blob/main/demos/cuj1-slinky-slurm.md).
 - **Accelerator/OS-specific tuning** (nodewright-customizations, nvidia-dra-driver-gpu) varies by hardware and OS combination.
 
 ### NFD Topology Updater

@@ -128,6 +128,7 @@ require (
 	github.com/google/s2a-go v0.1.9 // indirect
 	github.com/googleapis/enterprise-certificate-proxy v0.3.16 // indirect
 	github.com/googleapis/gax-go/v2 v2.22.0 // indirect
+	github.com/gorilla/websocket v1.5.4-0.20250319132907-e064f32e3674 // indirect
 	github.com/grpc-ecosystem/grpc-gateway/v2 v2.29.0 // indirect
 	github.com/hashicorp/go-cleanhttp v0.5.2 // indirect
 	github.com/hashicorp/go-retryablehttp v0.7.8 // indirect
@@ -145,6 +146,7 @@ require (
 	github.com/mattn/go-isatty v0.0.22 // indirect
 	github.com/mitchellh/copystructure v1.2.0 // indirect
 	github.com/mitchellh/reflectwalk v1.0.2 // indirect
+	github.com/moby/spdystream v0.5.1 // indirect
 	github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
 	github.com/modern-go/reflect2 v1.0.3-0.20250322232337-35a7c28c31ee // indirect
 	github.com/monochromegane/go-gitignore v0.0.0-20200626010858-205db1a8cc00 // indirect
@@ -202,6 +204,7 @@ require (
 	k8s.io/klog/v2 v2.140.0 // indirect
 	k8s.io/kube-openapi v0.0.0-20260603220949-865597e52e25 // indirect
 	k8s.io/kubernetes v1.36.2 // indirect
+	k8s.io/streaming v0.36.2 // indirect
 	sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730 // indirect
 	sigs.k8s.io/randfill v1.0.0 // indirect
 	sigs.k8s.io/structured-merge-diff/v6 v6.4.0 // indirect