Skip to content

feat: ROCm path resolution (auto-detect, MAD_ROCM_PATH, TheRock markers)#110

Open
coketaste wants to merge 11 commits intodevelopfrom
coketaste/v2-rocm-path
Open

feat: ROCm path resolution (auto-detect, MAD_ROCM_PATH, TheRock markers)#110
coketaste wants to merge 11 commits intodevelopfrom
coketaste/v2-rocm-path

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

Summary

Adds default ROCm install root resolution for the run phase: scan common layouts (including versioned /opt/rocm-* and TheRock markers), with clear overrides for host vs container. Aligns in-container ROCM_PATH with docker_env_vars when the image layout differs from the host.

Motivation

Users on TheRock- or CI-style images may not have a plain /opt/rocm tree; madengine should still find a valid ROCm root for validation and tool paths, while allowing explicit MAD_ROCM_PATH / --rocm-path when auto-detect is wrong or undesired.

What changed

  • New helpers: madengine.utils.rocm_path_resolver and madengine.utils.therock_markers (shared TheRock share/therock file markers).
  • Host precedence: top-level MAD_ROCM_PATH in additional context → --rocm-path → auto-detect (unless MAD_AUTO_ROCM_PATH=0) → ROCM_PATH/opt/rocm.
  • Container: docker_env_vars.MAD_ROCM_PATH sets in-container ROCM_PATH (key consumed); otherwise host-resolved path is mirrored unless the user set ROCM_PATH in docker_env_vars.
  • Wired through Context, run orchestrator, and CLI/validators; README, configuration, usage, and CHANGELOG updated.
  • therock_detector script: prepends the package parent to sys.path so imports work when run as a file; last-resort path fallback only if the script is outside the tree.
  • Unit tests (test_rocm_path, test_therock_markers) and integration GPU test adjustments.

How to test

  • pytest tests/unit/test_rocm_path.py tests/unit/test_therock_markers.py (or full tests/unit/).
  • Optional: madengine run --help and confirm --rocm-path / env behavior as documented.
  • Optional: on a TheRock or versioned-ROCm image, run a minimal madengine run without MAD_ROCM_PATH and confirm a sensible ROCM_PATH in the generated context.

Checklist (optional; trim for your team)

  • Docs / help text match intended precedence.
  • No behavior change for MAD_AUTO_ROCM_PATH=0 beyond documented legacy path.
  • CI / reviewers: link internal ticket (e.g. ROCM-21753) if required.

Add rocm_path_resolver and shared therock_markers; wire Context, run CLI,
validators, and orchestrator. Document in README, configuration, and usage;
update CHANGELOG. Extend unit and integration tests. Adjust therock_detector
(sys.path helper) without ADR docs in this commit.

Made-with: Cursor
@coketaste coketaste self-assigned this Apr 22, 2026
Copilot AI review requested due to automatic review settings April 22, 2026 23:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds robust ROCm root resolution for madengine run, including auto-detection for versioned /opt/rocm-* and TheRock layouts, plus explicit host/container override semantics via MAD_ROCM_PATH.

Changes:

  • Introduces ROCm root detection/normalization helpers (rocm_path_resolver) and shared TheRock file-marker helpers (therock_markers).
  • Wires host/container ROCm-path precedence through Context, run_orchestrator, CLI validation, and help text.
  • Updates docs/tests (unit + integration) to reflect the new resolution behavior and TheRock markers.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/unit/test_therock_markers.py Adds unit tests for TheRock marker path helpers and detection.
tests/unit/test_rocm_path.py Adds/extends unit tests for host/container ROCm path precedence and auto-detect behavior.
tests/integration/test_gpu_management.py Adjusts integration tests to patch the correct vendor-detection import path and stabilizes a Context-based test with mocks.
src/madengine/utils/therock_markers.py Adds canonical TheRock marker relative paths and helper functions.
src/madengine/utils/rocm_path_resolver.py Implements host/container ROCm root resolution, auto-detect heuristics, and legacy compatibility functions.
src/madengine/scripts/common/tools/therock_detector.py Reuses shared TheRock marker helpers and improves importability when run as a standalone script.
src/madengine/orchestration/run_orchestrator.py Defers ROCm path resolution to Context instead of pre-resolving in the orchestrator.
src/madengine/core/context.py Uses new resolver functions to compute host ROCm path and set container ROCM_PATH consistently.
src/madengine/core/constants.py Keeps get_rocm_path() as a legacy (non-auto-detect) wrapper delegating to the new resolver’s legacy function.
src/madengine/cli/validators.py Adds schema validation for host/container MAD_ROCM_PATH values in additional context.
src/madengine/cli/commands/run.py Updates --rocm-path help text to reflect new precedence/auto-detect behavior.
docs/usage.md Documents host vs container ROCm root overrides and auto-detect disabling.
docs/configuration.md Documents new precedence rules and override mechanisms for ROCm root resolution.
README.md Updates top-level docs to reflect new ROCm-path behavior and override knobs.
CHANGELOG.md Notes the new ROCm auto-detect + override behavior and references a design doc.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/cli/validators.py
Comment thread src/madengine/utils/rocm_path_resolver.py Outdated
Comment thread src/madengine/utils/rocm_path_resolver.py Outdated
Comment thread src/madengine/utils/rocm_path_resolver.py
Comment thread docs/configuration.md Outdated
Comment thread CHANGELOG.md Outdated
coketaste and others added 2 commits April 22, 2026 19:56
… checks

- csv_parser.py: resolve rocm-smi via shutil.which() so TheRock images
  (where tools live in a Python venv, not /opt/rocm/bin/) are detected
  correctly; accept path_resolver to get ROCm version without hardcoding
  /opt/rocm/.info/version; add bounds check in NVIDIA GPU info parser
- rocenv_tool.py: pass path_resolver to CSVParser so version resolution
  uses RocmPathResolver.get_version() for both TheRock and traditional installs
- container_runner.py: replace host-resolved hardcoded amd-smi/rocm-smi
  paths in container exec with PATH-based resolution; add run phase
  environment table showing host vs container installation type, ROCm/CUDA
  root, and version side-by-side (supports apt install and therock layouts)
- docs: document run phase environment table in usage.md and configuration.md

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 23, 2026 02:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a centralized ROCm install-root resolution flow for the run phase (host + container), including auto-detection of versioned /opt/rocm-* and TheRock marker layouts, and wires it through context creation, container execution, CLI validation, and documentation.

Changes:

  • Introduces madengine.utils.rocm_path_resolver + madengine.utils.therock_markers and updates Context to resolve host ROCm root and propagate container ROCM_PATH with clear override precedence.
  • Updates runtime execution (container runner + rocEnvTool CSV output) to better handle TheRock-style layouts where tools are discovered via PATH.
  • Adds unit tests for TheRock markers and ROCm path resolution, adjusts integration tests, and updates CLI help + docs/README/CHANGELOG to document the new behavior.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unit/test_therock_markers.py Adds unit coverage for TheRock marker-path helpers.
tests/unit/test_rocm_path.py Adds unit coverage for new host/container ROCm path resolution behavior and TheRock layouts.
tests/integration/test_gpu_management.py Updates integration mocking/patch points to match current vendor detection paths and stabilizes Context initialization in tests.
src/madengine/utils/therock_markers.py Defines shared TheRock on-disk marker locations and detection helper.
src/madengine/utils/rocm_path_resolver.py Implements host/container ROCm root resolution, auto-detect, and legacy compatibility functions.
src/madengine/scripts/common/tools/therock_detector.py Reuses shared marker helpers; improves importability when run as a standalone script.
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py Passes path resolver into CSVParser so version resolution follows detected ROCm root.
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py Uses PATH discovery for rocm-smi and uses resolver-based version when available.
src/madengine/orchestration/run_orchestrator.py Stops pre-resolving ROCm path in orchestrator; delegates to Context/resolver logic.
src/madengine/execution/container_runner.py Uses in-container PATH for SMI tools and prints a host vs container environment summary table.
src/madengine/core/context.py Centralizes ROCm path resolution and container ROCM_PATH propagation during runtime context init.
src/madengine/core/constants.py Reframes get_rocm_path() as legacy behavior and delegates to get_rocm_path_legacy().
src/madengine/cli/validators.py Validates new MAD_ROCM_PATH fields in additional context (top-level + docker_env_vars).
src/madengine/cli/commands/run.py Updates --rocm-path help to match new precedence/auto-detect behavior.
docs/usage.md Documents host/container override behavior and the new run-phase environment table output.
docs/configuration.md Documents precedence, overrides, and references ADR for ROCm path resolution.
README.md Updates high-level ROCm path messaging to match new auto-detect + MAD_ROCM_PATH override model.
CHANGELOG.md Records new ROCm path resolution behavior and references design documentation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/execution/container_runner.py Outdated
Comment thread src/madengine/execution/container_runner.py
Comment thread docs/configuration.md Outdated
Comment thread CHANGELOG.md Outdated
Comment thread src/madengine/utils/rocm_path_resolver.py Outdated
coketaste and others added 3 commits April 22, 2026 21:48
…able

The previous shell command embedded \$(rocm-sdk path --root) inside
[ -f "..." ], which broke quoting when passed via docker exec bash -c "..."
causing the test to always fall through to 'unknown'.

Replace with a quoting-safe check: if rocm-sdk command exists and returns
a root path, report 'therock'; otherwise check /opt/rocm/.info/version
for traditional apt install layout.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- Apply docker_env_vars overrides in context; finalize in run_container (AMD)
  from OCI image env, then docker run probe, then /opt/rocm with warning
- Update docs, README, CLI reference, CHANGELOG, and unit tests
- ADR 0001 left out of this commit (untracked)

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 24, 2026 02:11
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds robust ROCm root resolution for the run phase, including host-side auto-detection (TheRock + versioned /opt/rocm-*) and explicit host/container override behavior, and updates runtime logging + docs/tests accordingly.

Changes:

  • Introduces madengine.utils.rocm_path_resolver + madengine.utils.therock_markers to centralize ROCm/TheRock detection and documented precedence.
  • Updates runtime wiring (Context + ContainerRunner) so host ROCm root is resolved early, while container ROCM_PATH is finalized at Docker run time (OCI env → probe → fallback).
  • Adds/updates unit + integration tests and updates documentation/README/CHANGELOG to reflect the new precedence and behaviors.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/madengine/utils/rocm_path_resolver.py New host/container ROCm path resolution logic (auto-detect, overrides, OCI/probe finalization).
src/madengine/utils/therock_markers.py New shared TheRock marker path helpers.
src/madengine/core/context.py Uses new host resolver; stops mirroring host ROCm root into container env at init; applies container override mapping only.
src/madengine/execution/container_runner.py Finalizes container ROCM_PATH at run time; prints host/container environment summary; uses PATH-based SMI tool invocation in container.
src/madengine/orchestration/run_orchestrator.py Passes --rocm-path through to Context without legacy resolution.
src/madengine/core/constants.py get_rocm_path() becomes legacy wrapper delegating to resolver’s legacy behavior.
src/madengine/cli/validators.py Validates MAD_ROCM_PATH in top-level context and docker_env_vars.
src/madengine/cli/commands/run.py Updates --rocm-path help text to describe new host behavior + auto-detect toggle.
src/madengine/scripts/common/tools/therock_detector.py Imports TheRock marker helpers reliably when run as a file.
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py Threads path_resolver into CSV parsing.
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py Uses PATH-based rocm-smi detection; uses resolver for ROCm version; adds parsing guard.
tests/unit/test_rocm_path.py Adds coverage for new resolver APIs and revised Context/container behavior.
tests/unit/test_therock_markers.py Adds unit tests for TheRock marker helpers.
tests/integration/test_gpu_management.py Fixes patch target for vendor auto-detection; stabilizes non-AMD context creation in integration tests.
docs/usage.md Documents host vs container ROCm path behavior + new run-phase environment table.
docs/configuration.md Documents new precedence and container finalization behavior (OCI/probe).
docs/installation.md Clarifies host vs container ROCm path configuration.
docs/cli-reference.md Updates CLI reference for --rocm-path and env vars.
README.md Updates high-level ROCm path behavior summary.
CHANGELOG.md Notes behavior change and new resolver modules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/utils/rocm_path_resolver.py Outdated
Comment thread src/madengine/utils/rocm_path_resolver.py Outdated
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py
Comment thread docs/cli-reference.md Outdated
coketaste and others added 2 commits April 24, 2026 03:08
MI350X (gfx950) was missing from the skip_gpu_arch fixture, causing
test_commandline_argument_skip_gpu_arch to fail on this machine.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…TH in --additional-context

The --rocm-path flag was an alias for top-level MAD_ROCM_PATH but its
help text implied it could set both host and container paths, risking
confusion when environments differ. Users should set host and container
ROCm roots independently via --additional-context:

  Host:      {"MAD_ROCM_PATH": "/path/to/host/rocm"}
  Container: {"docker_env_vars": {"MAD_ROCM_PATH": "/path/in/container"}}

Also fixes datetime.utcnow() deprecation warnings in mongodb.py
(replaced with datetime.now(timezone.utc)).

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 24, 2026 19:44
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new ROCm root resolution flow that can auto-detect host ROCm installs (including versioned /opt/rocm-* and TheRock marker layouts) and cleanly separates host ROCm resolution from in-container ROCM_PATH selection at Docker run time.

Changes:

  • Added madengine.utils.rocm_path_resolver (host auto-detect + container finalization) and madengine.utils.therock_markers (shared TheRock marker paths).
  • Updated runtime wiring so Context resolves host ROCm root via the resolver, while ContainerRunner finalizes container ROCM_PATH at run time (OCI env → probe → default).
  • Removed madengine run --rocm-path usage in favor of MAD_ROCM_PATH (top-level and docker_env_vars), and updated docs/tests accordingly.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/madengine/utils/rocm_path_resolver.py Implements host ROCm auto-detect and container ROCM_PATH override/finalization logic.
src/madengine/utils/therock_markers.py Defines canonical TheRock marker relative paths + helper predicates.
src/madengine/core/context.py Switches host ROCm path resolution to the new resolver and defers container ROCM_PATH finalization.
src/madengine/execution/container_runner.py Finalizes in-container ROCM_PATH during Docker runs and adds a host/container environment summary table.
src/madengine/core/constants.py Reframes get_rocm_path() as legacy resolution and delegates to resolver legacy helper.
src/madengine/cli/commands/run.py Removes --rocm-path CLI option wiring.
src/madengine/cli/validators.py Validates MAD_ROCM_PATH (host and docker_env_vars) shapes in additional context.
src/madengine/scripts/common/tools/therock_detector.py Makes the tool importable when executed as a file; reuses shared marker helpers.
src/madengine/scripts/common/pre_scripts/rocEnvTool/{rocenv_tool.py,csv_parser.py} Passes resolver into CSV parsing and improves tool/path selection for TheRock-style layouts.
tests/unit/test_rocm_path.py Adds/updates unit coverage for resolver behaviors and container override rules.
tests/unit/test_therock_markers.py Adds unit coverage for TheRock marker helpers.
tests/integration/test_gpu_management.py Updates patch points and stabilizes integration behavior with Context initialization changes.
docs/{usage.md,installation.md,configuration.md,cli-reference.md} Updates documented precedence and configuration guidance for host vs container ROCm roots.
README.md Updates top-level ROCm path guidance and examples.
CHANGELOG.md Adds an Unreleased entry describing new ROCm resolution behavior.
src/madengine/database/mongodb.py Switches metadata timestamps to timezone-aware UTC.
tests/fixtures/dummy/models.json Adds gfx950 to skip list for a dummy model.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md Outdated
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/utils/rocm_path_resolver.py
Comment thread src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py Outdated
Comment thread CHANGELOG.md Outdated
Comment thread src/madengine/utils/rocm_path_resolver.py Outdated
coketaste and others added 2 commits April 24, 2026 16:22
…alongside is_therock_tree)

Conflict in container_runner.py: HEAD added is_therock_tree import for TheRock
environment table; develop added PERFORMANCE_LOG_PATTERN import from deployment.base.
Both imports are used in the file — keep both.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- validators.py: restrict docker_env_vars.MAD_ROCM_PATH to str|None only
  (was allowing int/float, inconsistent with error message)

- rocm_path_resolver.py: always pop MAD_ROCM_PATH from docker_env even
  when None, preventing the key from leaking into docker run env

- rocm_path_resolver.py: add bin/rocm-smi to _CONTAINER_PROBE_SH
  try_root() predicate to match Python-side looks_like_rocm_root()

- container_runner.py: clear ROCM_PATH before finalize_container_rocm_path
  each run so multi-model runs re-resolve per docker_image instead of
  reusing image 1's path for all subsequent images

- csv_parser.py: shlex.quote() the rocm-smi path before shell interpolation;
  only increment num_gpu after successful NVIDIA line parse (git-ignored
  path, committed separately if needed)

- README.md: replace curly/smart quotes with ASCII quotes in bash examples

- CHANGELOG.md: remove stale --rocm-path alias reference (flag was removed);
  remove broken docs/adr/ link

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 24, 2026 21:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new ROCm root resolution system to better support TheRock/versioned ROCm layouts and clearly separate host vs container ROCm path overrides, wiring it through context initialization and Docker run-time environment setup.

Changes:

  • Introduces rocm_path_resolver (host auto-detect + container finalization) and therock_markers (shared marker paths).
  • Updates runtime flow to resolve host ROCm root in Context and set container ROCM_PATH at Docker run-time (OCI env → probe → default), plus prints a host/container environment summary table.
  • Removes --rocm-path and updates validators/tests/docs to use MAD_ROCM_PATH (host) and docker_env_vars.MAD_ROCM_PATH (container).

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/madengine/utils/rocm_path_resolver.py Implements host ROCm auto-detect and container ROCM_PATH finalization logic.
src/madengine/utils/therock_markers.py Adds shared TheRock marker path helpers and detection.
src/madengine/core/context.py Uses new host ROCm resolver; stops mirroring host ROCM_PATH into container env at init.
src/madengine/execution/container_runner.py Finalizes container ROCM_PATH at run-time and prints host/container env summary.
src/madengine/core/constants.py Reframes get_rocm_path() as legacy (no auto-detect) and delegates to resolver legacy helper.
src/madengine/cli/commands/run.py Removes --rocm-path option and associated wiring.
src/madengine/cli/validators.py Validates MAD_ROCM_PATH types in additional context (host + docker_env_vars).
src/madengine/orchestration/run_orchestrator.py Stops passing legacy rocm_path into Context.
src/madengine/scripts/common/tools/therock_detector.py Reuses shared marker helpers; makes imports work when run as a file.
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py Passes resolver into CSV generation for version lookup.
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py Improves tool discovery (PATH-based) and ROCm version collection fallback.
src/madengine/database/mongodb.py Stores upload timestamps using timezone-aware UTC datetimes.
tests/unit/test_rocm_path.py Adds/updates unit tests for resolver behavior and Context integration.
tests/unit/test_therock_markers.py Adds unit tests for TheRock marker helpers.
tests/integration/test_gpu_management.py Updates mocks/patch targets for vendor auto-detection and Context init behavior.
tests/fixtures/dummy/models.json Updates dummy fixture skip list to include gfx950.
docs/configuration.md Documents new ROCm path precedence and overrides.
docs/usage.md Documents host/container ROCm path behavior and new env summary table.
docs/installation.md Updates install guidance to MAD_ROCM_PATH / container override behavior.
docs/cli-reference.md Removes --rocm-path references; documents new env/override keys.
README.md Updates top-level ROCm path feature description and examples.
CHANGELOG.md Notes new default host auto-detect + container ROCM_PATH resolution behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py Outdated
Comment thread src/madengine/utils/rocm_path_resolver.py
Comment thread src/madengine/utils/rocm_path_resolver.py Outdated
Comment thread docs/configuration.md Outdated
Comment thread docs/usage.md Outdated
Comment thread src/madengine/utils/therock_markers.py Outdated
…n logic

- Remove broken docs/adr/ links from configuration.md, usage.md,
  rocm_path_resolver.py docstring, and therock_markers.py comment
- Remove resolve_container_rocm_path() (thin wrapper with no unique logic)
- Remove early apply_container_rocm_path_overrides() call in init_gpu_context:
  it set ROCM_PATH only for run_container to immediately pop it; container
  ROCM_PATH finalization now happens exclusively in finalize_container_rocm_path
- Fix auto_detect() inner loop: first iteration was a duplicate
  _looks_like_rocm_root(root) check already done before the loop; walk
  to root.parent first, then check
- Silence grep -oP stderr in container ROCm version command (BusyBox images)
- Guard whitespace-only ROCM_PATH in apply_container_rocm_path_overrides
- Update tests to match: MAD_ROCM_PATH preserved in docker_env_vars at
  context init, consumed at run time by finalize_container_rocm_path

All 383 unit tests pass.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants