refactor(gpu-arch): auto-detect MAD_SYSTEM_GPU_ARCHITECTURE for local full-run mode#113
refactor(gpu-arch): auto-detect MAD_SYSTEM_GPU_ARCHITECTURE for local full-run mode#113
Conversation
… full-run mode In full workflow (madengine run --tags), the build phase ran with build_only_mode=True and skipped GPU detection, leaving Dockerfiles that declare ARG MAD_SYSTEM_GPU_ARCHITECTURE without a default built with an empty value. Users had to manually pass --additional-context every time. - Context.__init__: add detect_local_gpu_arch param (default False), thread it to init_build_context() - Context.init_build_context: add detect_gpu_arch param; when True, reuse detect_gpu_vendor() + get_gpu_tool_manager() + normalize_architecture_name() to detect and inject MAD_SYSTEM_GPU_ARCHITECTURE into docker_build_arg before the build; user-provided value is never overridden; fails gracefully - BuildOrchestrator.__init__: accept and forward detect_local_gpu_arch to Context; add resolved-arch confirmation print in execute() - RunOrchestrator._build_phase: pass detect_local_gpu_arch=True so only the full-workflow build path auto-detects; standalone madengine build is unaffected (flag defaults to False) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR enables auto-detection of MAD_SYSTEM_GPU_ARCHITECTURE during the build phase of the full madengine run --tags ... workflow so Dockerfiles that require this build-arg (without a default) can build successfully without forcing users to pass it via --additional-context.
Changes:
- Enable GPU-arch auto-detection in
RunOrchestrator._build_phase()by passing a newdetect_local_gpu_arch=Trueflag intoBuildOrchestrator. - Add
detect_local_gpu_archplumbing throughBuildOrchestratorintoContextand implement optional build-only GPU arch injection inContext.init_build_context(). - Print a “resolved” message when
MAD_SYSTEM_GPU_ARCHITECTUREis present indocker_build_argbefore building.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/madengine/orchestration/run_orchestrator.py |
Turns on build-time GPU arch auto-detection for full build+run workflows by passing detect_local_gpu_arch=True. |
src/madengine/orchestration/build_orchestrator.py |
Adds detect_local_gpu_arch parameter, forwards it into Context, and prints when the arch is resolved. |
src/madengine/core/context.py |
Implements optional build-only detection of GPU arch and injects it into ctx["docker_build_arg"]["MAD_SYSTEM_GPU_ARCHITECTURE"]. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| context_string = repr(self.additional_context) | ||
| self.context = Context( | ||
| additional_context=context_string, | ||
| build_only_mode=True, | ||
| detect_local_gpu_arch=detect_local_gpu_arch, | ||
| ) |
There was a problem hiding this comment.
BuildOrchestrator now relies on Context(detect_local_gpu_arch=...) to auto-detect MAD_SYSTEM_GPU_ARCHITECTURE, but the Context is instantiated without passing the CLI-provided --rocm-path. On systems where ROCm is installed outside the default (/opt/rocm) and the user supplies --rocm-path, auto-detection may fail and the build arg remains unresolved. Pass rocm_path=get_rocm_path(getattr(args, "rocm_path", None)) (or equivalent) into Context here so build-time detection uses the same ROCm path as runtime.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # detect_local_gpu_arch=True: full workflow on a local single node — auto-detect | ||
| # MAD_SYSTEM_GPU_ARCHITECTURE before the build so Dockerfiles that require it | ||
| # (ARG MAD_SYSTEM_GPU_ARCHITECTURE with no default) are built correctly without | ||
| # requiring the user to manually pass --additional-context. | ||
| # The user's explicitly provided value (if any) is still respected and not overridden. | ||
| build_orch = BuildOrchestrator( | ||
| self.args, | ||
| self.additional_context, | ||
| detect_local_gpu_arch=True, | ||
| ) |
There was a problem hiding this comment.
detect_local_gpu_arch=True is passed unconditionally for every tag-triggered build (including when the user intends a Slurm/K8s deployment). This can auto-inject the coordinator node’s MAD_SYSTEM_GPU_ARCHITECTURE into docker_build_arg, which may be incorrect for distributed runs where the build node’s GPU (or lack of GPU) differs from the execution nodes. Consider enabling auto-detect only when the inferred target is local/docker (e.g., based on _infer_deployment_target(self.additional_context) or an explicit CLI target), and leave it disabled for slurm/k8s unless the user explicitly requests it.
| # Optionally auto-detect GPU architecture for local full-workflow builds (build+run). | ||
| # Skipped for standalone `madengine build` on non-GPU/CI nodes (detect_gpu_arch=False). | ||
| if detect_gpu_arch and "MAD_SYSTEM_GPU_ARCHITECTURE" not in self.ctx.get("docker_build_arg", {}): | ||
| try: | ||
| from madengine.utils.gpu_validator import detect_gpu_vendor | ||
| from madengine.execution.dockerfile_utils import normalize_architecture_name | ||
|
|
||
| vendor = detect_gpu_vendor(self._rocm_path) | ||
| if vendor in (GPUVendor.AMD, GPUVendor.NVIDIA): | ||
| manager = get_gpu_tool_manager(vendor, self._rocm_path) | ||
| raw_arch = manager.get_gpu_architecture() | ||
| arch = normalize_architecture_name(raw_arch) or raw_arch.strip() | ||
| self.ctx["docker_build_arg"]["MAD_SYSTEM_GPU_ARCHITECTURE"] = arch | ||
| print(f"Auto-detected GPU architecture for build: {arch}") | ||
| else: | ||
| print("Warning: No supported GPU detected; MAD_SYSTEM_GPU_ARCHITECTURE will not be set automatically.") | ||
| print("Consider providing it via --additional-context if needed for build args.") | ||
| except Exception as e: | ||
| print(f"Warning: Could not auto-detect GPU architecture for build: {e}") | ||
| print("Consider providing MAD_SYSTEM_GPU_ARCHITECTURE via --additional-context if needed for build args.") |
There was a problem hiding this comment.
The new build-only auto-detection path (detect_gpu_arch=True) is currently untested. Since Context already has unit coverage, please add a unit test that patches detect_gpu_vendor() / get_gpu_tool_manager().get_gpu_architecture() and asserts that ctx["docker_build_arg"]["MAD_SYSTEM_GPU_ARCHITECTURE"] is injected only when absent, and that failures are handled without raising in build-only mode.
Summary
madengine run --tags): the build phase previously ran withbuild_only_mode=Trueand skipped GPU detection entirely, soDockerfiles with
ARG MAD_SYSTEM_GPU_ARCHITECTURE(no default) were built with an empty value. Users were forced to manually supply--additional-context '{"docker_build_arg": {"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx942"}}'on every local run.GPU arch auto-detection — design
Detection fires only in the
RunOrchestrator._build_phase()path (full workflow). The flag isFalseby default so standalonemadengine buildis unaffected.Detection chain reuses existing modules — no new code:
detect_gpu_vendor()(utils/gpu_validator.py) — fast, filesystem-only check, no subprocessget_gpu_tool_manager(vendor, rocm_path)(utils/gpu_tool_factory.py) — singleton factorymanager.get_gpu_architecture()(utils/rocm_tool_manager.py/nvidia_tool_manager.py) — cached, with actionable error messagesnormalize_architecture_name()(execution/dockerfile_utils.py) — consistent formatUser-provided value in
--additional-contextis always respected and never overridden. Fails gracefully with a warning on nodes without a GPU.Test plan
madengine run --tags <tag>on a local GPU node — build phase printsAuto-detected GPU architecture for build: gfx942, no "unresolved" warning, image builds correctlymadengine run --tags <tag> --additional-context '{"docker_build_arg": {"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx90a"}}'— user value respected, auto-detect skippedmadengine build --tags <tag>on a non-GPU CI node — same behaviour as before (warning if Dockerfile needs arch, no crash)madengine run --manifest-file build_manifest.json— run-only path unchangedpytest tests/unit/ tests/e2e/