feat(cluster): add cross-platform RDMA recommender and SLURM GCM workflow by coketaste · Pull Request #114 · ROCm/madengine

coketaste · 2026-04-26T03:36:56Z

Summary

This PR introduces a new cluster feature layer in additional_context.cluster to support phased cluster integration:

Adds native RDMA environment recommendation support for both SLURM and Kubernetes, including runtime env application controls (mode, apply_env, strict) and per-node/per-pod artifacts.
Adds SLURM-only GCM integration for phase 1:
- targeted preflight health checks (check-hca, check-ibstat) with strict/warn gating
- one-shot lightweight collector (slurm_job_monitor --once) in result collection
- guardrails for allowlisted commands, timeout/retry behavior, and best-effort semantics
- source pinning/provenance checks for https://github.com/coketaste/gcm at fixed ref 9fed02cd0721d3937f8749672951185f31955bd4.
Extends config/manifest plumbing so cluster.* is validated, persisted, and merged across build/run orchestration.
Adds smoke assets and discoverability updates:
- smoke configs for SLURM/K8s
- examples/Makefile.smoke
- examples/run-smoke.sh wrapper
- smoke checklist docs and cross-links in README/deployment/configuration/example docs.

Why

This provides a safe, opt-in path to improve multi-node readiness and observability without changing default behavior. It enables incremental rollout with reproducible dependencies and explicit guardrails.

Scope / Guardrails

RDMA recommender: enabled for SLURM + Kubernetes.
GCM integration: SLURM-only in phase 1.
Feature flags default off (no behavior change unless enabled).
Collector remains best-effort by default.

Test Plan

Validate config schema acceptance for additional_context.cluster (including allowlist enforcement).
SLURM smoke: run examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json and confirm:
- RDMA artifact generation
- GCM health summary/raw logs
- collector artifact output
Kubernetes smoke: run examples/k8s-configs/configs/smoke-rdma-k8s.json and confirm RDMA artifact generation.
Verify strict-mode behavior:
- cluster.rdma.strict=true gates when recommendation fails
- cluster.gcm.strict=true gates on failed health checks (SLURM only)
Verify env precedence semantics:
- mode=recommend preserves explicit user env overrides
- mode=enforce applies recommender values over conflicting keys.

Add a native RDMA recommendation path across SLURM and Kubernetes plus phased SLURM-only GCM preflight/collector controls behind additional_context.cluster so cluster features are opt-in, reproducible, and safer to operate. Include smoke configs, wrapper tooling, and docs to make validation and rollout straightforward.

Add minimal smoke configuration files for SLURM and Kubernetes so RDMA and phase-1 SLURM GCM behaviors can be validated quickly with consistent, reproducible settings.

Copilot

Pull request overview

Adds an opt-in additional_context.cluster feature layer to support phased cluster integration across SLURM and Kubernetes, including RDMA environment recommendations and (phase-1) SLURM-only GCM health/collector integrations, plus smoke assets and documentation to validate the new flow.

Changes:

Plumbs cluster.* through validation, manifests, and build/run orchestration so deployments can consume it.
Adds RDMA recommender execution + per-node/per-pod artifact capture for SLURM and Kubernetes.
Adds SLURM-only GCM preflight health checks and a best-effort collector snapshot during result collection, plus smoke configs/docs.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/madengine/orchestration/run_orchestrator.py	Merges `cluster` into runtime additional_context and persists it into synthetic manifests for local-image mode.
src/madengine/orchestration/build_orchestrator.py	Persists `cluster` into `deployment_config` in the build manifest.
src/madengine/deployment/templates/slurm/job.sh.j2	Runs RDMA recommender in SLURM job/task flows; copies RDMA artifacts into per-node collection directories.
src/madengine/deployment/templates/kubernetes/job.yaml.j2	Runs RDMA recommender inside pods and copies RDMA artifacts into results.
src/madengine/deployment/slurm.py	Introduces `cluster` parsing, RDMA/GCM wiring, GCM preflight + collector snapshot, and attaches cluster feature summaries to results.
src/madengine/deployment/rdma_recommender.py	New native RDMA environment recommender producing a JSON artifact + optional env file.
src/madengine/deployment/kubernetes.py	Adds `cluster.rdma` handling: embeds recommender script into ConfigMap, passes RDMA template flags, and summarizes RDMA artifacts in results.
src/madengine/cli/validators.py	Adds schema validation for `additional_context.cluster` (RDMA + GCM).
examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json	New SLURM smoke config enabling RDMA + phase-1 GCM.
examples/slurm-configs/README.md	Documents the new cluster smoke config and runner usage for SLURM.
examples/k8s-configs/configs/smoke-rdma-k8s.json	New Kubernetes smoke config enabling RDMA.
examples/k8s-configs/README.md	Documents the new cluster smoke config and runner usage for Kubernetes.
examples/run-smoke.sh	Adds a wrapper script to run/verify SLURM and Kubernetes smoke targets via `Makefile.smoke`.
examples/cluster-smoke-checklist.md	Adds a step-by-step checklist for validating RDMA+GCM smoke artifacts.
examples/Makefile.smoke	Adds make targets to run and verify smoke workflows for SLURM and Kubernetes.
docs/deployment.md	Documents cluster feature stage placement and points to smoke assets.
docs/configuration.md	Documents the `additional_context.cluster` schema and behavior (RDMA + SLURM-only GCM).
README.md	Adds a “Smoke Testing” section linking to the new smoke assets and workflows.

Comments suppressed due to low confidence (1)

src/madengine/cli/validators.py:155

deploy is treated as a nested object here, but the rest of the codebase (and examples) use deploy as a string target (e.g. "slurm"/"k8s"). This structure check will reject valid configs; adjust validation so deploy is validated as a string (or remove it from the nested-object list and validate separately).

        "deployment_config",
        "deploy",
    ):
        if nest in context and not isinstance(context[nest], dict):
            _fail_structure(nest, "an object")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-26T03:44:55Z

    DEPLOYMENT_TYPE = "slurm"
    REQUIRED_TOOLS = ["sbatch", "squeue", "scontrol"]  # Must be available locally
+    GCM_ALLOWED_HEALTH_CHECKS = {"check-hca", "check-ibstat"}
+    GCM_ALLOWED_COLLECTORS = {"slurm_job_monitor"}


REQUIRED_TOOLS is used by validate() to check local SLURM tooling, but validate() also calls sinfo, and the new GCM preflight uses srun. Consider adding sinfo (and srun when cluster.gcm.enabled is true) to the required tool checks so failures are surfaced consistently and earlier.

Copilot · 2026-04-26T03:44:55Z

+
+        collector_cmd = collector_cfg.get("command", "slurm_job_monitor")
+        timeout_sec = int(collector_cfg.get("timeout_sec", 120))
+        max_retries = int(collector_cfg.get("max_retries", 1))


max_retries can be configured to 0/negative, which makes the while attempts < max_retries loop skip entirely; in that case the function returns failed_best_effort without ever writing the collector output file. Clamp max_retries to at least 1 (or explicitly handle 0 by writing a skipped/error artifact).

Suggested change

max_retries = int(collector_cfg.get("max_retries", 1))

max_retries = max(1, int(collector_cfg.get("max_retries", 1)))

Copilot · 2026-04-26T03:44:55Z

+# Native RDMA recommender (submission node / single-node path)
+echo ""
+echo "Running RDMA recommender..."
+RDMA_ARTIFACT="${WORKSPACE}/{{ rdma_artifact_name }}"
+RDMA_ENV_FILE="${WORKSPACE}/rdma_env.list"
+if [ "{{ rdma_strict }}" = "True" ]; then
+  python3 -m madengine.deployment.rdma_recommender \
+    --output "${RDMA_ARTIFACT}" \
+    --env-file "${RDMA_ENV_FILE}" \
+    --strict
+else
+  python3 -m madengine.deployment.rdma_recommender \
+    --output "${RDMA_ARTIFACT}" \
+    --env-file "${RDMA_ENV_FILE}"
+fi
+RDMA_EXIT=$?
+if [ ${RDMA_EXIT} -ne 0 ]; then
+  echo "RDMA recommender failed with exit code ${RDMA_EXIT}"
+  {% if rdma_strict %}
+  echo "RDMA strict mode enabled; stopping job."
+  exit ${RDMA_EXIT}
+  {% else %}
+  echo "Continuing (strict=false)."
+  {% endif %}
+fi


For single-node runs (nodes == 1), the RDMA recommender is invoked here and then again later in the single-node execution path (another {% if rdma_enabled %} block). This duplicates work and can overwrite artifacts/env decisions; keep a single invocation for the single-node flow (and update the comment about “submission node” accordingly).

Suggested change

# Native RDMA recommender (submission node / single-node path)

echo ""

echo "Running RDMA recommender..."

RDMA_ARTIFACT="${WORKSPACE}/{{ rdma_artifact_name }}"

RDMA_ENV_FILE="${WORKSPACE}/rdma_env.list"

if [ "{{ rdma_strict }}" = "True" ]; then

python3 -m madengine.deployment.rdma_recommender \

--output "${RDMA_ARTIFACT}" \

--env-file "${RDMA_ENV_FILE}" \

--strict

else

python3 -m madengine.deployment.rdma_recommender \

--output "${RDMA_ARTIFACT}" \

--env-file "${RDMA_ENV_FILE}"

fi

RDMA_EXIT=$?

if [ ${RDMA_EXIT} -ne 0 ]; then

echo "RDMA recommender failed with exit code ${RDMA_EXIT}"

{% if rdma_strict %}

echo "RDMA strict mode enabled; stopping job."

exit ${RDMA_EXIT}

{% else %}

echo "Continuing (strict=false)."

{% endif %}

fi

# Single-node RDMA recommender is invoked later in the execution path.

# Do not run it here to avoid duplicate work and overwriting RDMA artifacts/env.

{% endif %}

Copilot · 2026-04-26T03:44:56Z

+## 0) Set shared variables
+
+```bash
+cd /home/ysha/amd/madengine


This checklist hard-codes a developer-specific path (/home/ysha/amd/madengine). Use a generic instruction (e.g., cd <repo-root>) so the doc is usable by others and in CI environments.

Suggested change

cd /home/ysha/amd/madengine

cd <repo-root>

Copilot · 2026-04-26T03:44:56Z

+    if "cluster" in context:
+        cluster = context["cluster"]
+        if not isinstance(cluster, dict):
+            _fail_structure("cluster", "an object")
+


The new cluster.* validation logic doesn’t appear to have any unit test coverage (no references in tests/), so regressions in schema/allowlist behavior may go unnoticed. Add validator tests covering both accepted and rejected cluster.rdma/cluster.gcm configurations (including allowlist enforcement and mode enum).

Copilot · 2026-04-26T03:44:56Z

+                    if int_key in collector and not isinstance(collector[int_key], int):
+                        _fail_structure(
+                            f"cluster.gcm.collector.{int_key}", "an integer"
+                        )


cluster.gcm.collector.timeout_sec and max_retries are only type-checked. Values like max_retries=0 or negative numbers will lead to unexpected runtime behavior (e.g., collector loop never runs). Validate sensible minimums (e.g., timeout_sec >= 1, max_retries >= 1).

Suggested change

)

)

if "timeout_sec" in collector and collector["timeout_sec"] < 1:

_fail_value(

"cluster.gcm.collector.timeout_sec", "must be >= 1"

)

if "max_retries" in collector and collector["max_retries"] < 1:

_fail_value(

"cluster.gcm.collector.max_retries", "must be >= 1"

)

coketaste added 2 commits April 25, 2026 23:31

feat(examples): add SLURM and K8s cluster smoke configs

f6911d5

Add minimal smoke configuration files for SLURM and Kubernetes so RDMA and phase-1 SLURM GCM behaviors can be validated quickly with consistent, reproducible settings.

coketaste self-assigned this Apr 26, 2026

Copilot AI review requested due to automatic review settings April 26, 2026 03:36

coketaste requested review from Cemberk, Rohan138, gargrahul and leconcio as code owners April 26, 2026 03:36

Copilot started reviewing on behalf of coketaste April 26, 2026 03:38 View session

coketaste requested review from lcskrishna and raviguptaamd April 26, 2026 03:38

coketaste changed the title ~~feat(cluster): add cross-platform RDMA recommender and SLURM GCM smoke workflow~~ feat(cluster): add cross-platform RDMA recommender and SLURM GCM workflow Apr 26, 2026

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cluster): add cross-platform RDMA recommender and SLURM GCM workflow#114

feat(cluster): add cross-platform RDMA recommender and SLURM GCM workflow#114
coketaste wants to merge 2 commits intodevelopfrom
coketaste/v2-rdma-gcm

coketaste commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	max_retries = int(collector_cfg.get("max_retries", 1))
	max_retries = max(1, int(collector_cfg.get("max_retries", 1)))

-                        )
+                        )
+                if "timeout_sec" in collector and collector["timeout_sec"] < 1:
+                    _fail_value(
+                        "cluster.gcm.collector.timeout_sec", "must be >= 1"
+                    )
+                if "max_retries" in collector and collector["max_retries"] < 1:
+                    _fail_value(
+                        "cluster.gcm.collector.max_retries", "must be >= 1"
+                    )

Conversation

coketaste commented Apr 26, 2026

Summary

Why

Scope / Guardrails

Test Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants