feat(cluster): add cross-platform RDMA recommender and SLURM GCM workflow#114
feat(cluster): add cross-platform RDMA recommender and SLURM GCM workflow#114
Conversation
Add a native RDMA recommendation path across SLURM and Kubernetes plus phased SLURM-only GCM preflight/collector controls behind additional_context.cluster so cluster features are opt-in, reproducible, and safer to operate. Include smoke configs, wrapper tooling, and docs to make validation and rollout straightforward.
Add minimal smoke configuration files for SLURM and Kubernetes so RDMA and phase-1 SLURM GCM behaviors can be validated quickly with consistent, reproducible settings.
There was a problem hiding this comment.
Pull request overview
Adds an opt-in additional_context.cluster feature layer to support phased cluster integration across SLURM and Kubernetes, including RDMA environment recommendations and (phase-1) SLURM-only GCM health/collector integrations, plus smoke assets and documentation to validate the new flow.
Changes:
- Plumbs
cluster.*through validation, manifests, and build/run orchestration so deployments can consume it. - Adds RDMA recommender execution + per-node/per-pod artifact capture for SLURM and Kubernetes.
- Adds SLURM-only GCM preflight health checks and a best-effort collector snapshot during result collection, plus smoke configs/docs.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/madengine/orchestration/run_orchestrator.py | Merges cluster into runtime additional_context and persists it into synthetic manifests for local-image mode. |
| src/madengine/orchestration/build_orchestrator.py | Persists cluster into deployment_config in the build manifest. |
| src/madengine/deployment/templates/slurm/job.sh.j2 | Runs RDMA recommender in SLURM job/task flows; copies RDMA artifacts into per-node collection directories. |
| src/madengine/deployment/templates/kubernetes/job.yaml.j2 | Runs RDMA recommender inside pods and copies RDMA artifacts into results. |
| src/madengine/deployment/slurm.py | Introduces cluster parsing, RDMA/GCM wiring, GCM preflight + collector snapshot, and attaches cluster feature summaries to results. |
| src/madengine/deployment/rdma_recommender.py | New native RDMA environment recommender producing a JSON artifact + optional env file. |
| src/madengine/deployment/kubernetes.py | Adds cluster.rdma handling: embeds recommender script into ConfigMap, passes RDMA template flags, and summarizes RDMA artifacts in results. |
| src/madengine/cli/validators.py | Adds schema validation for additional_context.cluster (RDMA + GCM). |
| examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json | New SLURM smoke config enabling RDMA + phase-1 GCM. |
| examples/slurm-configs/README.md | Documents the new cluster smoke config and runner usage for SLURM. |
| examples/k8s-configs/configs/smoke-rdma-k8s.json | New Kubernetes smoke config enabling RDMA. |
| examples/k8s-configs/README.md | Documents the new cluster smoke config and runner usage for Kubernetes. |
| examples/run-smoke.sh | Adds a wrapper script to run/verify SLURM and Kubernetes smoke targets via Makefile.smoke. |
| examples/cluster-smoke-checklist.md | Adds a step-by-step checklist for validating RDMA+GCM smoke artifacts. |
| examples/Makefile.smoke | Adds make targets to run and verify smoke workflows for SLURM and Kubernetes. |
| docs/deployment.md | Documents cluster feature stage placement and points to smoke assets. |
| docs/configuration.md | Documents the additional_context.cluster schema and behavior (RDMA + SLURM-only GCM). |
| README.md | Adds a “Smoke Testing” section linking to the new smoke assets and workflows. |
Comments suppressed due to low confidence (1)
src/madengine/cli/validators.py:155
deployis treated as a nested object here, but the rest of the codebase (and examples) usedeployas a string target (e.g. "slurm"/"k8s"). This structure check will reject valid configs; adjust validation sodeployis validated as a string (or remove it from the nested-object list and validate separately).
"deployment_config",
"deploy",
):
if nest in context and not isinstance(context[nest], dict):
_fail_structure(nest, "an object")
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| DEPLOYMENT_TYPE = "slurm" | ||
| REQUIRED_TOOLS = ["sbatch", "squeue", "scontrol"] # Must be available locally | ||
| GCM_ALLOWED_HEALTH_CHECKS = {"check-hca", "check-ibstat"} | ||
| GCM_ALLOWED_COLLECTORS = {"slurm_job_monitor"} |
There was a problem hiding this comment.
REQUIRED_TOOLS is used by validate() to check local SLURM tooling, but validate() also calls sinfo, and the new GCM preflight uses srun. Consider adding sinfo (and srun when cluster.gcm.enabled is true) to the required tool checks so failures are surfaced consistently and earlier.
|
|
||
| collector_cmd = collector_cfg.get("command", "slurm_job_monitor") | ||
| timeout_sec = int(collector_cfg.get("timeout_sec", 120)) | ||
| max_retries = int(collector_cfg.get("max_retries", 1)) |
There was a problem hiding this comment.
max_retries can be configured to 0/negative, which makes the while attempts < max_retries loop skip entirely; in that case the function returns failed_best_effort without ever writing the collector output file. Clamp max_retries to at least 1 (or explicitly handle 0 by writing a skipped/error artifact).
| max_retries = int(collector_cfg.get("max_retries", 1)) | |
| max_retries = max(1, int(collector_cfg.get("max_retries", 1))) |
| # Native RDMA recommender (submission node / single-node path) | ||
| echo "" | ||
| echo "Running RDMA recommender..." | ||
| RDMA_ARTIFACT="${WORKSPACE}/{{ rdma_artifact_name }}" | ||
| RDMA_ENV_FILE="${WORKSPACE}/rdma_env.list" | ||
| if [ "{{ rdma_strict }}" = "True" ]; then | ||
| python3 -m madengine.deployment.rdma_recommender \ | ||
| --output "${RDMA_ARTIFACT}" \ | ||
| --env-file "${RDMA_ENV_FILE}" \ | ||
| --strict | ||
| else | ||
| python3 -m madengine.deployment.rdma_recommender \ | ||
| --output "${RDMA_ARTIFACT}" \ | ||
| --env-file "${RDMA_ENV_FILE}" | ||
| fi | ||
| RDMA_EXIT=$? | ||
| if [ ${RDMA_EXIT} -ne 0 ]; then | ||
| echo "RDMA recommender failed with exit code ${RDMA_EXIT}" | ||
| {% if rdma_strict %} | ||
| echo "RDMA strict mode enabled; stopping job." | ||
| exit ${RDMA_EXIT} | ||
| {% else %} | ||
| echo "Continuing (strict=false)." | ||
| {% endif %} | ||
| fi |
There was a problem hiding this comment.
For single-node runs (nodes == 1), the RDMA recommender is invoked here and then again later in the single-node execution path (another {% if rdma_enabled %} block). This duplicates work and can overwrite artifacts/env decisions; keep a single invocation for the single-node flow (and update the comment about “submission node” accordingly).
| # Native RDMA recommender (submission node / single-node path) | |
| echo "" | |
| echo "Running RDMA recommender..." | |
| RDMA_ARTIFACT="${WORKSPACE}/{{ rdma_artifact_name }}" | |
| RDMA_ENV_FILE="${WORKSPACE}/rdma_env.list" | |
| if [ "{{ rdma_strict }}" = "True" ]; then | |
| python3 -m madengine.deployment.rdma_recommender \ | |
| --output "${RDMA_ARTIFACT}" \ | |
| --env-file "${RDMA_ENV_FILE}" \ | |
| --strict | |
| else | |
| python3 -m madengine.deployment.rdma_recommender \ | |
| --output "${RDMA_ARTIFACT}" \ | |
| --env-file "${RDMA_ENV_FILE}" | |
| fi | |
| RDMA_EXIT=$? | |
| if [ ${RDMA_EXIT} -ne 0 ]; then | |
| echo "RDMA recommender failed with exit code ${RDMA_EXIT}" | |
| {% if rdma_strict %} | |
| echo "RDMA strict mode enabled; stopping job." | |
| exit ${RDMA_EXIT} | |
| {% else %} | |
| echo "Continuing (strict=false)." | |
| {% endif %} | |
| fi | |
| # Single-node RDMA recommender is invoked later in the execution path. | |
| # Do not run it here to avoid duplicate work and overwriting RDMA artifacts/env. | |
| {% endif %} |
| ## 0) Set shared variables | ||
|
|
||
| ```bash | ||
| cd /home/ysha/amd/madengine |
There was a problem hiding this comment.
This checklist hard-codes a developer-specific path (/home/ysha/amd/madengine). Use a generic instruction (e.g., cd <repo-root>) so the doc is usable by others and in CI environments.
| cd /home/ysha/amd/madengine | |
| cd <repo-root> |
| if "cluster" in context: | ||
| cluster = context["cluster"] | ||
| if not isinstance(cluster, dict): | ||
| _fail_structure("cluster", "an object") | ||
|
|
There was a problem hiding this comment.
The new cluster.* validation logic doesn’t appear to have any unit test coverage (no references in tests/), so regressions in schema/allowlist behavior may go unnoticed. Add validator tests covering both accepted and rejected cluster.rdma/cluster.gcm configurations (including allowlist enforcement and mode enum).
| if int_key in collector and not isinstance(collector[int_key], int): | ||
| _fail_structure( | ||
| f"cluster.gcm.collector.{int_key}", "an integer" | ||
| ) |
There was a problem hiding this comment.
cluster.gcm.collector.timeout_sec and max_retries are only type-checked. Values like max_retries=0 or negative numbers will lead to unexpected runtime behavior (e.g., collector loop never runs). Validate sensible minimums (e.g., timeout_sec >= 1, max_retries >= 1).
| ) | |
| ) | |
| if "timeout_sec" in collector and collector["timeout_sec"] < 1: | |
| _fail_value( | |
| "cluster.gcm.collector.timeout_sec", "must be >= 1" | |
| ) | |
| if "max_retries" in collector and collector["max_retries"] < 1: | |
| _fail_value( | |
| "cluster.gcm.collector.max_retries", "must be >= 1" | |
| ) |
Summary
This PR introduces a new cluster feature layer in
additional_context.clusterto support phased cluster integration:mode,apply_env,strict) and per-node/per-pod artifacts.check-hca,check-ibstat) with strict/warn gatingslurm_job_monitor --once) in result collectionhttps://github.com/coketaste/gcmat fixed ref9fed02cd0721d3937f8749672951185f31955bd4.cluster.*is validated, persisted, and merged across build/run orchestration.examples/Makefile.smokeexamples/run-smoke.shwrapperWhy
This provides a safe, opt-in path to improve multi-node readiness and observability without changing default behavior. It enables incremental rollout with reproducible dependencies and explicit guardrails.
Scope / Guardrails
Test Plan
additional_context.cluster(including allowlist enforcement).examples/slurm-configs/configs/smoke-rdma-gcm-slurm.jsonand confirm:examples/k8s-configs/configs/smoke-rdma-k8s.jsonand confirm RDMA artifact generation.cluster.rdma.strict=truegates when recommendation failscluster.gcm.strict=truegates on failed health checks (SLURM only)mode=recommendpreserves explicit user env overridesmode=enforceapplies recommender values over conflicting keys.