Skip to content

refactor(gpu-arch): auto-detect MAD_SYSTEM_GPU_ARCHITECTURE for local full-run mode#113

Open
coketaste wants to merge 1 commit intodevelopfrom
coketaste/mad-system-gpu-arch
Open

refactor(gpu-arch): auto-detect MAD_SYSTEM_GPU_ARCHITECTURE for local full-run mode#113
coketaste wants to merge 1 commit intodevelopfrom
coketaste/mad-system-gpu-arch

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

@coketaste coketaste commented Apr 24, 2026

Summary

  • Auto-detect GPU arch before build in full workflow (madengine run --tags): the build phase previously ran with build_only_mode=True and skipped GPU detection entirely, so
    Dockerfiles with ARG MAD_SYSTEM_GPU_ARCHITECTURE (no default) were built with an empty value. Users were forced to manually supply --additional-context '{"docker_build_arg": {"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx942"}}' on every local run.

GPU arch auto-detection — design

Detection fires only in the RunOrchestrator._build_phase() path (full workflow). The flag is False by default so standalone madengine build is unaffected.

Detection chain reuses existing modules — no new code:

  1. detect_gpu_vendor() (utils/gpu_validator.py) — fast, filesystem-only check, no subprocess
  2. get_gpu_tool_manager(vendor, rocm_path) (utils/gpu_tool_factory.py) — singleton factory
  3. manager.get_gpu_architecture() (utils/rocm_tool_manager.py / nvidia_tool_manager.py) — cached, with actionable error messages
  4. normalize_architecture_name() (execution/dockerfile_utils.py) — consistent format

User-provided value in --additional-context is always respected and never overridden. Fails gracefully with a warning on nodes without a GPU.

Test plan

  • madengine run --tags <tag> on a local GPU node — build phase prints Auto-detected GPU architecture for build: gfx942, no "unresolved" warning, image builds correctly
  • madengine run --tags <tag> --additional-context '{"docker_build_arg": {"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx90a"}}' — user value respected, auto-detect skipped
  • madengine build --tags <tag> on a non-GPU CI node — same behaviour as before (warning if Dockerfile needs arch, no crash)
  • madengine run --manifest-file build_manifest.json — run-only path unchanged
  • Existing unit tests pass: pytest tests/unit/ tests/e2e/

… full-run mode

In full workflow (madengine run --tags), the build phase ran with
build_only_mode=True and skipped GPU detection, leaving Dockerfiles that
declare ARG MAD_SYSTEM_GPU_ARCHITECTURE without a default built with an
empty value. Users had to manually pass --additional-context every time.

- Context.__init__: add detect_local_gpu_arch param (default False),
  thread it to init_build_context()
- Context.init_build_context: add detect_gpu_arch param; when True,
  reuse detect_gpu_vendor() + get_gpu_tool_manager() + normalize_architecture_name()
  to detect and inject MAD_SYSTEM_GPU_ARCHITECTURE into docker_build_arg
  before the build; user-provided value is never overridden; fails gracefully
- BuildOrchestrator.__init__: accept and forward detect_local_gpu_arch
  to Context; add resolved-arch confirmation print in execute()
- RunOrchestrator._build_phase: pass detect_local_gpu_arch=True so only
  the full-workflow build path auto-detects; standalone madengine build
  is unaffected (flag defaults to False)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@coketaste coketaste self-assigned this Apr 24, 2026
Copilot AI review requested due to automatic review settings April 24, 2026 04:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables auto-detection of MAD_SYSTEM_GPU_ARCHITECTURE during the build phase of the full madengine run --tags ... workflow so Dockerfiles that require this build-arg (without a default) can build successfully without forcing users to pass it via --additional-context.

Changes:

  • Enable GPU-arch auto-detection in RunOrchestrator._build_phase() by passing a new detect_local_gpu_arch=True flag into BuildOrchestrator.
  • Add detect_local_gpu_arch plumbing through BuildOrchestrator into Context and implement optional build-only GPU arch injection in Context.init_build_context().
  • Print a “resolved” message when MAD_SYSTEM_GPU_ARCHITECTURE is present in docker_build_arg before building.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/madengine/orchestration/run_orchestrator.py Turns on build-time GPU arch auto-detection for full build+run workflows by passing detect_local_gpu_arch=True.
src/madengine/orchestration/build_orchestrator.py Adds detect_local_gpu_arch parameter, forwards it into Context, and prints when the arch is resolved.
src/madengine/core/context.py Implements optional build-only detection of GPU arch and injects it into ctx["docker_build_arg"]["MAD_SYSTEM_GPU_ARCHITECTURE"].

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 133 to 138
context_string = repr(self.additional_context)
self.context = Context(
additional_context=context_string,
build_only_mode=True,
detect_local_gpu_arch=detect_local_gpu_arch,
)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BuildOrchestrator now relies on Context(detect_local_gpu_arch=...) to auto-detect MAD_SYSTEM_GPU_ARCHITECTURE, but the Context is instantiated without passing the CLI-provided --rocm-path. On systems where ROCm is installed outside the default (/opt/rocm) and the user supplies --rocm-path, auto-detection may fail and the build arg remains unresolved. Pass rocm_path=get_rocm_path(getattr(args, "rocm_path", None)) (or equivalent) into Context here so build-time detection uses the same ROCm path as runtime.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +348 to +357
# detect_local_gpu_arch=True: full workflow on a local single node — auto-detect
# MAD_SYSTEM_GPU_ARCHITECTURE before the build so Dockerfiles that require it
# (ARG MAD_SYSTEM_GPU_ARCHITECTURE with no default) are built correctly without
# requiring the user to manually pass --additional-context.
# The user's explicitly provided value (if any) is still respected and not overridden.
build_orch = BuildOrchestrator(
self.args,
self.additional_context,
detect_local_gpu_arch=True,
)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect_local_gpu_arch=True is passed unconditionally for every tag-triggered build (including when the user intends a Slurm/K8s deployment). This can auto-inject the coordinator node’s MAD_SYSTEM_GPU_ARCHITECTURE into docker_build_arg, which may be incorrect for distributed runs where the build node’s GPU (or lack of GPU) differs from the execution nodes. Consider enabling auto-detect only when the inferred target is local/docker (e.g., based on _infer_deployment_target(self.additional_context) or an explicit CLI target), and leave it disabled for slurm/k8s unless the user explicitly requests it.

Copilot uses AI. Check for mistakes.
Comment on lines +181 to +200
# Optionally auto-detect GPU architecture for local full-workflow builds (build+run).
# Skipped for standalone `madengine build` on non-GPU/CI nodes (detect_gpu_arch=False).
if detect_gpu_arch and "MAD_SYSTEM_GPU_ARCHITECTURE" not in self.ctx.get("docker_build_arg", {}):
try:
from madengine.utils.gpu_validator import detect_gpu_vendor
from madengine.execution.dockerfile_utils import normalize_architecture_name

vendor = detect_gpu_vendor(self._rocm_path)
if vendor in (GPUVendor.AMD, GPUVendor.NVIDIA):
manager = get_gpu_tool_manager(vendor, self._rocm_path)
raw_arch = manager.get_gpu_architecture()
arch = normalize_architecture_name(raw_arch) or raw_arch.strip()
self.ctx["docker_build_arg"]["MAD_SYSTEM_GPU_ARCHITECTURE"] = arch
print(f"Auto-detected GPU architecture for build: {arch}")
else:
print("Warning: No supported GPU detected; MAD_SYSTEM_GPU_ARCHITECTURE will not be set automatically.")
print("Consider providing it via --additional-context if needed for build args.")
except Exception as e:
print(f"Warning: Could not auto-detect GPU architecture for build: {e}")
print("Consider providing MAD_SYSTEM_GPU_ARCHITECTURE via --additional-context if needed for build args.")
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new build-only auto-detection path (detect_gpu_arch=True) is currently untested. Since Context already has unit coverage, please add a unit test that patches detect_gpu_vendor() / get_gpu_tool_manager().get_gpu_architecture() and asserts that ctx["docker_build_arg"]["MAD_SYSTEM_GPU_ARCHITECTURE"] is injected only when absent, and that failures are handled without raising in build-only mode.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants