Skip to content

feat(cluster): add cross-platform RDMA recommender and SLURM GCM workflow#114

Open
coketaste wants to merge 2 commits intodevelopfrom
coketaste/v2-rdma-gcm
Open

feat(cluster): add cross-platform RDMA recommender and SLURM GCM workflow#114
coketaste wants to merge 2 commits intodevelopfrom
coketaste/v2-rdma-gcm

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

Summary

This PR introduces a new cluster feature layer in additional_context.cluster to support phased cluster integration:

  • Adds native RDMA environment recommendation support for both SLURM and Kubernetes, including runtime env application controls (mode, apply_env, strict) and per-node/per-pod artifacts.
  • Adds SLURM-only GCM integration for phase 1:
    • targeted preflight health checks (check-hca, check-ibstat) with strict/warn gating
    • one-shot lightweight collector (slurm_job_monitor --once) in result collection
    • guardrails for allowlisted commands, timeout/retry behavior, and best-effort semantics
    • source pinning/provenance checks for https://github.com/coketaste/gcm at fixed ref 9fed02cd0721d3937f8749672951185f31955bd4.
  • Extends config/manifest plumbing so cluster.* is validated, persisted, and merged across build/run orchestration.
  • Adds smoke assets and discoverability updates:
    • smoke configs for SLURM/K8s
    • examples/Makefile.smoke
    • examples/run-smoke.sh wrapper
    • smoke checklist docs and cross-links in README/deployment/configuration/example docs.

Why

This provides a safe, opt-in path to improve multi-node readiness and observability without changing default behavior. It enables incremental rollout with reproducible dependencies and explicit guardrails.

Scope / Guardrails

  • RDMA recommender: enabled for SLURM + Kubernetes.
  • GCM integration: SLURM-only in phase 1.
  • Feature flags default off (no behavior change unless enabled).
  • Collector remains best-effort by default.

Test Plan

  • Validate config schema acceptance for additional_context.cluster (including allowlist enforcement).
  • SLURM smoke: run examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json and confirm:
    • RDMA artifact generation
    • GCM health summary/raw logs
    • collector artifact output
  • Kubernetes smoke: run examples/k8s-configs/configs/smoke-rdma-k8s.json and confirm RDMA artifact generation.
  • Verify strict-mode behavior:
    • cluster.rdma.strict=true gates when recommendation fails
    • cluster.gcm.strict=true gates on failed health checks (SLURM only)
  • Verify env precedence semantics:
    • mode=recommend preserves explicit user env overrides
    • mode=enforce applies recommender values over conflicting keys.

Add a native RDMA recommendation path across SLURM and Kubernetes plus phased SLURM-only GCM preflight/collector controls behind additional_context.cluster so cluster features are opt-in, reproducible, and safer to operate. Include smoke configs, wrapper tooling, and docs to make validation and rollout straightforward.
Add minimal smoke configuration files for SLURM and Kubernetes so RDMA and phase-1 SLURM GCM behaviors can be validated quickly with consistent, reproducible settings.
@coketaste coketaste self-assigned this Apr 26, 2026
Copilot AI review requested due to automatic review settings April 26, 2026 03:36
@coketaste coketaste changed the title feat(cluster): add cross-platform RDMA recommender and SLURM GCM smoke workflow feat(cluster): add cross-platform RDMA recommender and SLURM GCM workflow Apr 26, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in additional_context.cluster feature layer to support phased cluster integration across SLURM and Kubernetes, including RDMA environment recommendations and (phase-1) SLURM-only GCM health/collector integrations, plus smoke assets and documentation to validate the new flow.

Changes:

  • Plumbs cluster.* through validation, manifests, and build/run orchestration so deployments can consume it.
  • Adds RDMA recommender execution + per-node/per-pod artifact capture for SLURM and Kubernetes.
  • Adds SLURM-only GCM preflight health checks and a best-effort collector snapshot during result collection, plus smoke configs/docs.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/madengine/orchestration/run_orchestrator.py Merges cluster into runtime additional_context and persists it into synthetic manifests for local-image mode.
src/madengine/orchestration/build_orchestrator.py Persists cluster into deployment_config in the build manifest.
src/madengine/deployment/templates/slurm/job.sh.j2 Runs RDMA recommender in SLURM job/task flows; copies RDMA artifacts into per-node collection directories.
src/madengine/deployment/templates/kubernetes/job.yaml.j2 Runs RDMA recommender inside pods and copies RDMA artifacts into results.
src/madengine/deployment/slurm.py Introduces cluster parsing, RDMA/GCM wiring, GCM preflight + collector snapshot, and attaches cluster feature summaries to results.
src/madengine/deployment/rdma_recommender.py New native RDMA environment recommender producing a JSON artifact + optional env file.
src/madengine/deployment/kubernetes.py Adds cluster.rdma handling: embeds recommender script into ConfigMap, passes RDMA template flags, and summarizes RDMA artifacts in results.
src/madengine/cli/validators.py Adds schema validation for additional_context.cluster (RDMA + GCM).
examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json New SLURM smoke config enabling RDMA + phase-1 GCM.
examples/slurm-configs/README.md Documents the new cluster smoke config and runner usage for SLURM.
examples/k8s-configs/configs/smoke-rdma-k8s.json New Kubernetes smoke config enabling RDMA.
examples/k8s-configs/README.md Documents the new cluster smoke config and runner usage for Kubernetes.
examples/run-smoke.sh Adds a wrapper script to run/verify SLURM and Kubernetes smoke targets via Makefile.smoke.
examples/cluster-smoke-checklist.md Adds a step-by-step checklist for validating RDMA+GCM smoke artifacts.
examples/Makefile.smoke Adds make targets to run and verify smoke workflows for SLURM and Kubernetes.
docs/deployment.md Documents cluster feature stage placement and points to smoke assets.
docs/configuration.md Documents the additional_context.cluster schema and behavior (RDMA + SLURM-only GCM).
README.md Adds a “Smoke Testing” section linking to the new smoke assets and workflows.
Comments suppressed due to low confidence (1)

src/madengine/cli/validators.py:155

  • deploy is treated as a nested object here, but the rest of the codebase (and examples) use deploy as a string target (e.g. "slurm"/"k8s"). This structure check will reject valid configs; adjust validation so deploy is validated as a string (or remove it from the nested-object list and validate separately).
        "deployment_config",
        "deploy",
    ):
        if nest in context and not isinstance(context[nest], dict):
            _fail_structure(nest, "an object")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 50 to +53
DEPLOYMENT_TYPE = "slurm"
REQUIRED_TOOLS = ["sbatch", "squeue", "scontrol"] # Must be available locally
GCM_ALLOWED_HEALTH_CHECKS = {"check-hca", "check-ibstat"}
GCM_ALLOWED_COLLECTORS = {"slurm_job_monitor"}
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REQUIRED_TOOLS is used by validate() to check local SLURM tooling, but validate() also calls sinfo, and the new GCM preflight uses srun. Consider adding sinfo (and srun when cluster.gcm.enabled is true) to the required tool checks so failures are surfaced consistently and earlier.

Copilot uses AI. Check for mistakes.

collector_cmd = collector_cfg.get("command", "slurm_job_monitor")
timeout_sec = int(collector_cfg.get("timeout_sec", 120))
max_retries = int(collector_cfg.get("max_retries", 1))
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_retries can be configured to 0/negative, which makes the while attempts < max_retries loop skip entirely; in that case the function returns failed_best_effort without ever writing the collector output file. Clamp max_retries to at least 1 (or explicitly handle 0 by writing a skipped/error artifact).

Suggested change
max_retries = int(collector_cfg.get("max_retries", 1))
max_retries = max(1, int(collector_cfg.get("max_retries", 1)))

Copilot uses AI. Check for mistakes.
Comment on lines +309 to +333
# Native RDMA recommender (submission node / single-node path)
echo ""
echo "Running RDMA recommender..."
RDMA_ARTIFACT="${WORKSPACE}/{{ rdma_artifact_name }}"
RDMA_ENV_FILE="${WORKSPACE}/rdma_env.list"
if [ "{{ rdma_strict }}" = "True" ]; then
python3 -m madengine.deployment.rdma_recommender \
--output "${RDMA_ARTIFACT}" \
--env-file "${RDMA_ENV_FILE}" \
--strict
else
python3 -m madengine.deployment.rdma_recommender \
--output "${RDMA_ARTIFACT}" \
--env-file "${RDMA_ENV_FILE}"
fi
RDMA_EXIT=$?
if [ ${RDMA_EXIT} -ne 0 ]; then
echo "RDMA recommender failed with exit code ${RDMA_EXIT}"
{% if rdma_strict %}
echo "RDMA strict mode enabled; stopping job."
exit ${RDMA_EXIT}
{% else %}
echo "Continuing (strict=false)."
{% endif %}
fi
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For single-node runs (nodes == 1), the RDMA recommender is invoked here and then again later in the single-node execution path (another {% if rdma_enabled %} block). This duplicates work and can overwrite artifacts/env decisions; keep a single invocation for the single-node flow (and update the comment about “submission node” accordingly).

Suggested change
# Native RDMA recommender (submission node / single-node path)
echo ""
echo "Running RDMA recommender..."
RDMA_ARTIFACT="${WORKSPACE}/{{ rdma_artifact_name }}"
RDMA_ENV_FILE="${WORKSPACE}/rdma_env.list"
if [ "{{ rdma_strict }}" = "True" ]; then
python3 -m madengine.deployment.rdma_recommender \
--output "${RDMA_ARTIFACT}" \
--env-file "${RDMA_ENV_FILE}" \
--strict
else
python3 -m madengine.deployment.rdma_recommender \
--output "${RDMA_ARTIFACT}" \
--env-file "${RDMA_ENV_FILE}"
fi
RDMA_EXIT=$?
if [ ${RDMA_EXIT} -ne 0 ]; then
echo "RDMA recommender failed with exit code ${RDMA_EXIT}"
{% if rdma_strict %}
echo "RDMA strict mode enabled; stopping job."
exit ${RDMA_EXIT}
{% else %}
echo "Continuing (strict=false)."
{% endif %}
fi
# Single-node RDMA recommender is invoked later in the execution path.
# Do not run it here to avoid duplicate work and overwriting RDMA artifacts/env.
{% endif %}

Copilot uses AI. Check for mistakes.
## 0) Set shared variables

```bash
cd /home/ysha/amd/madengine
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checklist hard-codes a developer-specific path (/home/ysha/amd/madengine). Use a generic instruction (e.g., cd <repo-root>) so the doc is usable by others and in CI environments.

Suggested change
cd /home/ysha/amd/madengine
cd <repo-root>

Copilot uses AI. Check for mistakes.
Comment on lines +207 to +211
if "cluster" in context:
cluster = context["cluster"]
if not isinstance(cluster, dict):
_fail_structure("cluster", "an object")

Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new cluster.* validation logic doesn’t appear to have any unit test coverage (no references in tests/), so regressions in schema/allowlist behavior may go unnoticed. Add validator tests covering both accepted and rejected cluster.rdma/cluster.gcm configurations (including allowlist enforcement and mode enum).

Copilot uses AI. Check for mistakes.
if int_key in collector and not isinstance(collector[int_key], int):
_fail_structure(
f"cluster.gcm.collector.{int_key}", "an integer"
)
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster.gcm.collector.timeout_sec and max_retries are only type-checked. Values like max_retries=0 or negative numbers will lead to unexpected runtime behavior (e.g., collector loop never runs). Validate sensible minimums (e.g., timeout_sec >= 1, max_retries >= 1).

Suggested change
)
)
if "timeout_sec" in collector and collector["timeout_sec"] < 1:
_fail_value(
"cluster.gcm.collector.timeout_sec", "must be >= 1"
)
if "max_retries" in collector and collector["max_retries"] < 1:
_fail_value(
"cluster.gcm.collector.max_retries", "must be >= 1"
)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants