refactor(gpu-arch): auto-detect MAD_SYSTEM_GPU_ARCHITECTURE for local full-run mode by coketaste · Pull Request #113 · ROCm/madengine

coketaste · 2026-04-24T04:00:21Z

Summary

Auto-detect GPU arch before build in full workflow (madengine run --tags): the build phase previously ran with build_only_mode=True and skipped GPU detection entirely, so
Dockerfiles with ARG MAD_SYSTEM_GPU_ARCHITECTURE (no default) were built with an empty value. Users were forced to manually supply --additional-context '{"docker_build_arg": {"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx942"}}' on every local run.

GPU arch auto-detection — design

Detection fires only in the RunOrchestrator._build_phase() path (full workflow). The flag is False by default so standalone madengine build is unaffected.

Detection chain reuses existing modules — no new code:

detect_gpu_vendor() (utils/gpu_validator.py) — fast, filesystem-only check, no subprocess
get_gpu_tool_manager(vendor, rocm_path) (utils/gpu_tool_factory.py) — singleton factory
manager.get_gpu_architecture() (utils/rocm_tool_manager.py / nvidia_tool_manager.py) — cached, with actionable error messages
normalize_architecture_name() (execution/dockerfile_utils.py) — consistent format

User-provided value in --additional-context is always respected and never overridden. Fails gracefully with a warning on nodes without a GPU.

Test plan

madengine run --tags <tag> on a local GPU node — build phase prints Auto-detected GPU architecture for build: gfx942, no "unresolved" warning, image builds correctly
madengine run --tags <tag> --additional-context '{"docker_build_arg": {"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx90a"}}' — user value respected, auto-detect skipped
madengine build --tags <tag> on a non-GPU CI node — same behaviour as before (warning if Dockerfile needs arch, no crash)
madengine run --manifest-file build_manifest.json — run-only path unchanged
Existing unit tests pass: pytest tests/unit/ tests/e2e/

… full-run mode In full workflow (madengine run --tags), the build phase ran with build_only_mode=True and skipped GPU detection, leaving Dockerfiles that declare ARG MAD_SYSTEM_GPU_ARCHITECTURE without a default built with an empty value. Users had to manually pass --additional-context every time. - Context.__init__: add detect_local_gpu_arch param (default False), thread it to init_build_context() - Context.init_build_context: add detect_gpu_arch param; when True, reuse detect_gpu_vendor() + get_gpu_tool_manager() + normalize_architecture_name() to detect and inject MAD_SYSTEM_GPU_ARCHITECTURE into docker_build_arg before the build; user-provided value is never overridden; fails gracefully - BuildOrchestrator.__init__: accept and forward detect_local_gpu_arch to Context; add resolved-arch confirmation print in execute() - RunOrchestrator._build_phase: pass detect_local_gpu_arch=True so only the full-workflow build path auto-detects; standalone madengine build is unaffected (flag defaults to False) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Copilot

Pull request overview

This PR enables auto-detection of MAD_SYSTEM_GPU_ARCHITECTURE during the build phase of the full madengine run --tags ... workflow so Dockerfiles that require this build-arg (without a default) can build successfully without forcing users to pass it via --additional-context.

Changes:

Enable GPU-arch auto-detection in RunOrchestrator._build_phase() by passing a new detect_local_gpu_arch=True flag into BuildOrchestrator.
Add detect_local_gpu_arch plumbing through BuildOrchestrator into Context and implement optional build-only GPU arch injection in Context.init_build_context().
Print a “resolved” message when MAD_SYSTEM_GPU_ARCHITECTURE is present in docker_build_arg before building.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`src/madengine/orchestration/run_orchestrator.py`	Turns on build-time GPU arch auto-detection for full build+run workflows by passing `detect_local_gpu_arch=True`.
`src/madengine/orchestration/build_orchestrator.py`	Adds `detect_local_gpu_arch` parameter, forwards it into `Context`, and prints when the arch is resolved.
`src/madengine/core/context.py`	Implements optional build-only detection of GPU arch and injects it into `ctx["docker_build_arg"]["MAD_SYSTEM_GPU_ARCHITECTURE"]`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T04:06:04Z

        context_string = repr(self.additional_context)
        self.context = Context(
            additional_context=context_string,
            build_only_mode=True,
+            detect_local_gpu_arch=detect_local_gpu_arch,
        )


BuildOrchestrator now relies on Context(detect_local_gpu_arch=...) to auto-detect MAD_SYSTEM_GPU_ARCHITECTURE, but the Context is instantiated without passing the CLI-provided --rocm-path. On systems where ROCm is installed outside the default (/opt/rocm) and the user supplies --rocm-path, auto-detection may fail and the build arg remains unresolved. Pass rocm_path=get_rocm_path(getattr(args, "rocm_path", None)) (or equivalent) into Context here so build-time detection uses the same ROCm path as runtime.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T19:51:17Z

+        # detect_local_gpu_arch=True: full workflow on a local single node — auto-detect
+        # MAD_SYSTEM_GPU_ARCHITECTURE before the build so Dockerfiles that require it
+        # (ARG MAD_SYSTEM_GPU_ARCHITECTURE with no default) are built correctly without
+        # requiring the user to manually pass --additional-context.
+        # The user's explicitly provided value (if any) is still respected and not overridden.
+        build_orch = BuildOrchestrator(
+            self.args,
+            self.additional_context,
+            detect_local_gpu_arch=True,
+        )


detect_local_gpu_arch=True is passed unconditionally for every tag-triggered build (including when the user intends a Slurm/K8s deployment). This can auto-inject the coordinator node’s MAD_SYSTEM_GPU_ARCHITECTURE into docker_build_arg, which may be incorrect for distributed runs where the build node’s GPU (or lack of GPU) differs from the execution nodes. Consider enabling auto-detect only when the inferred target is local/docker (e.g., based on _infer_deployment_target(self.additional_context) or an explicit CLI target), and leave it disabled for slurm/k8s unless the user explicitly requests it.

Copilot · 2026-04-24T19:51:18Z

+        # Optionally auto-detect GPU architecture for local full-workflow builds (build+run).
+        # Skipped for standalone `madengine build` on non-GPU/CI nodes (detect_gpu_arch=False).
+        if detect_gpu_arch and "MAD_SYSTEM_GPU_ARCHITECTURE" not in self.ctx.get("docker_build_arg", {}):
+            try:
+                from madengine.utils.gpu_validator import detect_gpu_vendor
+                from madengine.execution.dockerfile_utils import normalize_architecture_name
+
+                vendor = detect_gpu_vendor(self._rocm_path)
+                if vendor in (GPUVendor.AMD, GPUVendor.NVIDIA):
+                    manager = get_gpu_tool_manager(vendor, self._rocm_path)
+                    raw_arch = manager.get_gpu_architecture()
+                    arch = normalize_architecture_name(raw_arch) or raw_arch.strip()
+                    self.ctx["docker_build_arg"]["MAD_SYSTEM_GPU_ARCHITECTURE"] = arch
+                    print(f"Auto-detected GPU architecture for build: {arch}")
+                else:
+                    print("Warning: No supported GPU detected; MAD_SYSTEM_GPU_ARCHITECTURE will not be set automatically.")
+                    print("Consider providing it via --additional-context if needed for build args.")
+            except Exception as e:
+                print(f"Warning: Could not auto-detect GPU architecture for build: {e}")
+                print("Consider providing MAD_SYSTEM_GPU_ARCHITECTURE via --additional-context if needed for build args.")


The new build-only auto-detection path (detect_gpu_arch=True) is currently untested. Since Context already has unit coverage, please add a unit test that patches detect_gpu_vendor() / get_gpu_tool_manager().get_gpu_architecture() and asserts that ctx["docker_build_arg"]["MAD_SYSTEM_GPU_ARCHITECTURE"] is injected only when absent, and that failures are handled without raising in build-only mode.

coketaste self-assigned this Apr 24, 2026

Copilot AI review requested due to automatic review settings April 24, 2026 04:00

coketaste requested review from Cemberk, Rohan138, gargrahul and leconcio as code owners April 24, 2026 04:00

Copilot started reviewing on behalf of coketaste April 24, 2026 04:01 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

coketaste requested a review from Copilot April 24, 2026 19:45

Copilot started reviewing on behalf of coketaste April 24, 2026 19:46 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(gpu-arch): auto-detect MAD_SYSTEM_GPU_ARCHITECTURE for local full-run mode#113

refactor(gpu-arch): auto-detect MAD_SYSTEM_GPU_ARCHITECTURE for local full-run mode#113
coketaste wants to merge 1 commit intodevelopfrom
coketaste/mad-system-gpu-arch

coketaste commented Apr 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

coketaste commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

GPU arch auto-detection — design

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coketaste commented Apr 24, 2026 •

edited

Loading