feat: ROCm path resolution (auto-detect, MAD_ROCM_PATH, TheRock markers)#110
feat: ROCm path resolution (auto-detect, MAD_ROCM_PATH, TheRock markers)#110
Conversation
Add rocm_path_resolver and shared therock_markers; wire Context, run CLI, validators, and orchestrator. Document in README, configuration, and usage; update CHANGELOG. Extend unit and integration tests. Adjust therock_detector (sys.path helper) without ADR docs in this commit. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Adds robust ROCm root resolution for madengine run, including auto-detection for versioned /opt/rocm-* and TheRock layouts, plus explicit host/container override semantics via MAD_ROCM_PATH.
Changes:
- Introduces ROCm root detection/normalization helpers (
rocm_path_resolver) and shared TheRock file-marker helpers (therock_markers). - Wires host/container ROCm-path precedence through
Context,run_orchestrator, CLI validation, and help text. - Updates docs/tests (unit + integration) to reflect the new resolution behavior and TheRock markers.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_therock_markers.py | Adds unit tests for TheRock marker path helpers and detection. |
| tests/unit/test_rocm_path.py | Adds/extends unit tests for host/container ROCm path precedence and auto-detect behavior. |
| tests/integration/test_gpu_management.py | Adjusts integration tests to patch the correct vendor-detection import path and stabilizes a Context-based test with mocks. |
| src/madengine/utils/therock_markers.py | Adds canonical TheRock marker relative paths and helper functions. |
| src/madengine/utils/rocm_path_resolver.py | Implements host/container ROCm root resolution, auto-detect heuristics, and legacy compatibility functions. |
| src/madengine/scripts/common/tools/therock_detector.py | Reuses shared TheRock marker helpers and improves importability when run as a standalone script. |
| src/madengine/orchestration/run_orchestrator.py | Defers ROCm path resolution to Context instead of pre-resolving in the orchestrator. |
| src/madengine/core/context.py | Uses new resolver functions to compute host ROCm path and set container ROCM_PATH consistently. |
| src/madengine/core/constants.py | Keeps get_rocm_path() as a legacy (non-auto-detect) wrapper delegating to the new resolver’s legacy function. |
| src/madengine/cli/validators.py | Adds schema validation for host/container MAD_ROCM_PATH values in additional context. |
| src/madengine/cli/commands/run.py | Updates --rocm-path help text to reflect new precedence/auto-detect behavior. |
| docs/usage.md | Documents host vs container ROCm root overrides and auto-detect disabling. |
| docs/configuration.md | Documents new precedence rules and override mechanisms for ROCm root resolution. |
| README.md | Updates top-level docs to reflect new ROCm-path behavior and override knobs. |
| CHANGELOG.md | Notes the new ROCm auto-detect + override behavior and references a design doc. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… checks - csv_parser.py: resolve rocm-smi via shutil.which() so TheRock images (where tools live in a Python venv, not /opt/rocm/bin/) are detected correctly; accept path_resolver to get ROCm version without hardcoding /opt/rocm/.info/version; add bounds check in NVIDIA GPU info parser - rocenv_tool.py: pass path_resolver to CSVParser so version resolution uses RocmPathResolver.get_version() for both TheRock and traditional installs - container_runner.py: replace host-resolved hardcoded amd-smi/rocm-smi paths in container exec with PATH-based resolution; add run phase environment table showing host vs container installation type, ROCm/CUDA root, and version side-by-side (supports apt install and therock layouts) - docs: document run phase environment table in usage.md and configuration.md Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a centralized ROCm install-root resolution flow for the run phase (host + container), including auto-detection of versioned /opt/rocm-* and TheRock marker layouts, and wires it through context creation, container execution, CLI validation, and documentation.
Changes:
- Introduces
madengine.utils.rocm_path_resolver+madengine.utils.therock_markersand updatesContextto resolve host ROCm root and propagate containerROCM_PATHwith clear override precedence. - Updates runtime execution (container runner + rocEnvTool CSV output) to better handle TheRock-style layouts where tools are discovered via
PATH. - Adds unit tests for TheRock markers and ROCm path resolution, adjusts integration tests, and updates CLI help + docs/README/CHANGELOG to document the new behavior.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_therock_markers.py | Adds unit coverage for TheRock marker-path helpers. |
| tests/unit/test_rocm_path.py | Adds unit coverage for new host/container ROCm path resolution behavior and TheRock layouts. |
| tests/integration/test_gpu_management.py | Updates integration mocking/patch points to match current vendor detection paths and stabilizes Context initialization in tests. |
| src/madengine/utils/therock_markers.py | Defines shared TheRock on-disk marker locations and detection helper. |
| src/madengine/utils/rocm_path_resolver.py | Implements host/container ROCm root resolution, auto-detect, and legacy compatibility functions. |
| src/madengine/scripts/common/tools/therock_detector.py | Reuses shared marker helpers; improves importability when run as a standalone script. |
| src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py | Passes path resolver into CSVParser so version resolution follows detected ROCm root. |
| src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py | Uses PATH discovery for rocm-smi and uses resolver-based version when available. |
| src/madengine/orchestration/run_orchestrator.py | Stops pre-resolving ROCm path in orchestrator; delegates to Context/resolver logic. |
| src/madengine/execution/container_runner.py | Uses in-container PATH for SMI tools and prints a host vs container environment summary table. |
| src/madengine/core/context.py | Centralizes ROCm path resolution and container ROCM_PATH propagation during runtime context init. |
| src/madengine/core/constants.py | Reframes get_rocm_path() as legacy behavior and delegates to get_rocm_path_legacy(). |
| src/madengine/cli/validators.py | Validates new MAD_ROCM_PATH fields in additional context (top-level + docker_env_vars). |
| src/madengine/cli/commands/run.py | Updates --rocm-path help to match new precedence/auto-detect behavior. |
| docs/usage.md | Documents host/container override behavior and the new run-phase environment table output. |
| docs/configuration.md | Documents precedence, overrides, and references ADR for ROCm path resolution. |
| README.md | Updates high-level ROCm path messaging to match new auto-detect + MAD_ROCM_PATH override model. |
| CHANGELOG.md | Records new ROCm path resolution behavior and references design documentation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…able The previous shell command embedded \$(rocm-sdk path --root) inside [ -f "..." ], which broke quoting when passed via docker exec bash -c "..." causing the test to always fall through to 'unknown'. Replace with a quoting-safe check: if rocm-sdk command exists and returns a root path, report 'therock'; otherwise check /opt/rocm/.info/version for traditional apt install layout. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- Apply docker_env_vars overrides in context; finalize in run_container (AMD) from OCI image env, then docker run probe, then /opt/rocm with warning - Update docs, README, CLI reference, CHANGELOG, and unit tests - ADR 0001 left out of this commit (untracked) Made-with: Cursor
…to coketaste/v2-rocm-path
There was a problem hiding this comment.
Pull request overview
Adds robust ROCm root resolution for the run phase, including host-side auto-detection (TheRock + versioned /opt/rocm-*) and explicit host/container override behavior, and updates runtime logging + docs/tests accordingly.
Changes:
- Introduces
madengine.utils.rocm_path_resolver+madengine.utils.therock_markersto centralize ROCm/TheRock detection and documented precedence. - Updates runtime wiring (Context + ContainerRunner) so host ROCm root is resolved early, while container
ROCM_PATHis finalized at Docker run time (OCI env → probe → fallback). - Adds/updates unit + integration tests and updates documentation/README/CHANGELOG to reflect the new precedence and behaviors.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/madengine/utils/rocm_path_resolver.py |
New host/container ROCm path resolution logic (auto-detect, overrides, OCI/probe finalization). |
src/madengine/utils/therock_markers.py |
New shared TheRock marker path helpers. |
src/madengine/core/context.py |
Uses new host resolver; stops mirroring host ROCm root into container env at init; applies container override mapping only. |
src/madengine/execution/container_runner.py |
Finalizes container ROCM_PATH at run time; prints host/container environment summary; uses PATH-based SMI tool invocation in container. |
src/madengine/orchestration/run_orchestrator.py |
Passes --rocm-path through to Context without legacy resolution. |
src/madengine/core/constants.py |
get_rocm_path() becomes legacy wrapper delegating to resolver’s legacy behavior. |
src/madengine/cli/validators.py |
Validates MAD_ROCM_PATH in top-level context and docker_env_vars. |
src/madengine/cli/commands/run.py |
Updates --rocm-path help text to describe new host behavior + auto-detect toggle. |
src/madengine/scripts/common/tools/therock_detector.py |
Imports TheRock marker helpers reliably when run as a file. |
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py |
Threads path_resolver into CSV parsing. |
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py |
Uses PATH-based rocm-smi detection; uses resolver for ROCm version; adds parsing guard. |
tests/unit/test_rocm_path.py |
Adds coverage for new resolver APIs and revised Context/container behavior. |
tests/unit/test_therock_markers.py |
Adds unit tests for TheRock marker helpers. |
tests/integration/test_gpu_management.py |
Fixes patch target for vendor auto-detection; stabilizes non-AMD context creation in integration tests. |
docs/usage.md |
Documents host vs container ROCm path behavior + new run-phase environment table. |
docs/configuration.md |
Documents new precedence and container finalization behavior (OCI/probe). |
docs/installation.md |
Clarifies host vs container ROCm path configuration. |
docs/cli-reference.md |
Updates CLI reference for --rocm-path and env vars. |
README.md |
Updates high-level ROCm path behavior summary. |
CHANGELOG.md |
Notes behavior change and new resolver modules. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
MI350X (gfx950) was missing from the skip_gpu_arch fixture, causing test_commandline_argument_skip_gpu_arch to fail on this machine. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…TH in --additional-context
The --rocm-path flag was an alias for top-level MAD_ROCM_PATH but its
help text implied it could set both host and container paths, risking
confusion when environments differ. Users should set host and container
ROCm roots independently via --additional-context:
Host: {"MAD_ROCM_PATH": "/path/to/host/rocm"}
Container: {"docker_env_vars": {"MAD_ROCM_PATH": "/path/in/container"}}
Also fixes datetime.utcnow() deprecation warnings in mongodb.py
(replaced with datetime.now(timezone.utc)).
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces a new ROCm root resolution flow that can auto-detect host ROCm installs (including versioned /opt/rocm-* and TheRock marker layouts) and cleanly separates host ROCm resolution from in-container ROCM_PATH selection at Docker run time.
Changes:
- Added
madengine.utils.rocm_path_resolver(host auto-detect + container finalization) andmadengine.utils.therock_markers(shared TheRock marker paths). - Updated runtime wiring so
Contextresolves host ROCm root via the resolver, whileContainerRunnerfinalizes containerROCM_PATHat run time (OCI env → probe → default). - Removed
madengine run --rocm-pathusage in favor ofMAD_ROCM_PATH(top-level anddocker_env_vars), and updated docs/tests accordingly.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/madengine/utils/rocm_path_resolver.py |
Implements host ROCm auto-detect and container ROCM_PATH override/finalization logic. |
src/madengine/utils/therock_markers.py |
Defines canonical TheRock marker relative paths + helper predicates. |
src/madengine/core/context.py |
Switches host ROCm path resolution to the new resolver and defers container ROCM_PATH finalization. |
src/madengine/execution/container_runner.py |
Finalizes in-container ROCM_PATH during Docker runs and adds a host/container environment summary table. |
src/madengine/core/constants.py |
Reframes get_rocm_path() as legacy resolution and delegates to resolver legacy helper. |
src/madengine/cli/commands/run.py |
Removes --rocm-path CLI option wiring. |
src/madengine/cli/validators.py |
Validates MAD_ROCM_PATH (host and docker_env_vars) shapes in additional context. |
src/madengine/scripts/common/tools/therock_detector.py |
Makes the tool importable when executed as a file; reuses shared marker helpers. |
src/madengine/scripts/common/pre_scripts/rocEnvTool/{rocenv_tool.py,csv_parser.py} |
Passes resolver into CSV parsing and improves tool/path selection for TheRock-style layouts. |
tests/unit/test_rocm_path.py |
Adds/updates unit coverage for resolver behaviors and container override rules. |
tests/unit/test_therock_markers.py |
Adds unit coverage for TheRock marker helpers. |
tests/integration/test_gpu_management.py |
Updates patch points and stabilizes integration behavior with Context initialization changes. |
docs/{usage.md,installation.md,configuration.md,cli-reference.md} |
Updates documented precedence and configuration guidance for host vs container ROCm roots. |
README.md |
Updates top-level ROCm path guidance and examples. |
CHANGELOG.md |
Adds an Unreleased entry describing new ROCm resolution behavior. |
src/madengine/database/mongodb.py |
Switches metadata timestamps to timezone-aware UTC. |
tests/fixtures/dummy/models.json |
Adds gfx950 to skip list for a dummy model. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…alongside is_therock_tree) Conflict in container_runner.py: HEAD added is_therock_tree import for TheRock environment table; develop added PERFORMANCE_LOG_PATTERN import from deployment.base. Both imports are used in the file — keep both. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- validators.py: restrict docker_env_vars.MAD_ROCM_PATH to str|None only (was allowing int/float, inconsistent with error message) - rocm_path_resolver.py: always pop MAD_ROCM_PATH from docker_env even when None, preventing the key from leaking into docker run env - rocm_path_resolver.py: add bin/rocm-smi to _CONTAINER_PROBE_SH try_root() predicate to match Python-side looks_like_rocm_root() - container_runner.py: clear ROCM_PATH before finalize_container_rocm_path each run so multi-model runs re-resolve per docker_image instead of reusing image 1's path for all subsequent images - csv_parser.py: shlex.quote() the rocm-smi path before shell interpolation; only increment num_gpu after successful NVIDIA line parse (git-ignored path, committed separately if needed) - README.md: replace curly/smart quotes with ASCII quotes in bash examples - CHANGELOG.md: remove stale --rocm-path alias reference (flag was removed); remove broken docs/adr/ link Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new ROCm root resolution system to better support TheRock/versioned ROCm layouts and clearly separate host vs container ROCm path overrides, wiring it through context initialization and Docker run-time environment setup.
Changes:
- Introduces
rocm_path_resolver(host auto-detect + container finalization) andtherock_markers(shared marker paths). - Updates runtime flow to resolve host ROCm root in
Contextand set containerROCM_PATHat Docker run-time (OCI env → probe → default), plus prints a host/container environment summary table. - Removes
--rocm-pathand updates validators/tests/docs to useMAD_ROCM_PATH(host) anddocker_env_vars.MAD_ROCM_PATH(container).
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
src/madengine/utils/rocm_path_resolver.py |
Implements host ROCm auto-detect and container ROCM_PATH finalization logic. |
src/madengine/utils/therock_markers.py |
Adds shared TheRock marker path helpers and detection. |
src/madengine/core/context.py |
Uses new host ROCm resolver; stops mirroring host ROCM_PATH into container env at init. |
src/madengine/execution/container_runner.py |
Finalizes container ROCM_PATH at run-time and prints host/container env summary. |
src/madengine/core/constants.py |
Reframes get_rocm_path() as legacy (no auto-detect) and delegates to resolver legacy helper. |
src/madengine/cli/commands/run.py |
Removes --rocm-path option and associated wiring. |
src/madengine/cli/validators.py |
Validates MAD_ROCM_PATH types in additional context (host + docker_env_vars). |
src/madengine/orchestration/run_orchestrator.py |
Stops passing legacy rocm_path into Context. |
src/madengine/scripts/common/tools/therock_detector.py |
Reuses shared marker helpers; makes imports work when run as a file. |
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py |
Passes resolver into CSV generation for version lookup. |
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py |
Improves tool discovery (PATH-based) and ROCm version collection fallback. |
src/madengine/database/mongodb.py |
Stores upload timestamps using timezone-aware UTC datetimes. |
tests/unit/test_rocm_path.py |
Adds/updates unit tests for resolver behavior and Context integration. |
tests/unit/test_therock_markers.py |
Adds unit tests for TheRock marker helpers. |
tests/integration/test_gpu_management.py |
Updates mocks/patch targets for vendor auto-detection and Context init behavior. |
tests/fixtures/dummy/models.json |
Updates dummy fixture skip list to include gfx950. |
docs/configuration.md |
Documents new ROCm path precedence and overrides. |
docs/usage.md |
Documents host/container ROCm path behavior and new env summary table. |
docs/installation.md |
Updates install guidance to MAD_ROCM_PATH / container override behavior. |
docs/cli-reference.md |
Removes --rocm-path references; documents new env/override keys. |
README.md |
Updates top-level ROCm path feature description and examples. |
CHANGELOG.md |
Notes new default host auto-detect + container ROCM_PATH resolution behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…n logic - Remove broken docs/adr/ links from configuration.md, usage.md, rocm_path_resolver.py docstring, and therock_markers.py comment - Remove resolve_container_rocm_path() (thin wrapper with no unique logic) - Remove early apply_container_rocm_path_overrides() call in init_gpu_context: it set ROCM_PATH only for run_container to immediately pop it; container ROCM_PATH finalization now happens exclusively in finalize_container_rocm_path - Fix auto_detect() inner loop: first iteration was a duplicate _looks_like_rocm_root(root) check already done before the loop; walk to root.parent first, then check - Silence grep -oP stderr in container ROCm version command (BusyBox images) - Guard whitespace-only ROCM_PATH in apply_container_rocm_path_overrides - Update tests to match: MAD_ROCM_PATH preserved in docker_env_vars at context init, consumed at run time by finalize_container_rocm_path All 383 unit tests pass. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Summary
Adds default ROCm install root resolution for the run phase: scan common layouts (including versioned
/opt/rocm-*and TheRock markers), with clear overrides for host vs container. Aligns in-containerROCM_PATHwithdocker_env_varswhen the image layout differs from the host.Motivation
Users on TheRock- or CI-style images may not have a plain
/opt/rocmtree;madengineshould still find a valid ROCm root for validation and tool paths, while allowing explicitMAD_ROCM_PATH/--rocm-pathwhen auto-detect is wrong or undesired.What changed
madengine.utils.rocm_path_resolverandmadengine.utils.therock_markers(shared TheRockshare/therockfile markers).MAD_ROCM_PATHin additional context →--rocm-path→ auto-detect (unlessMAD_AUTO_ROCM_PATH=0) →ROCM_PATH→/opt/rocm.docker_env_vars.MAD_ROCM_PATHsets in-containerROCM_PATH(key consumed); otherwise host-resolved path is mirrored unless the user setROCM_PATHindocker_env_vars.therock_detectorscript: prepends the package parent tosys.pathso imports work when run as a file; last-resort path fallback only if the script is outside the tree.test_rocm_path,test_therock_markers) and integration GPU test adjustments.How to test
pytest tests/unit/test_rocm_path.py tests/unit/test_therock_markers.py(or fulltests/unit/).madengine run --helpand confirm--rocm-path/ env behavior as documented.madengine runwithoutMAD_ROCM_PATHand confirm a sensibleROCM_PATHin the generated context.Checklist (optional; trim for your team)
MAD_AUTO_ROCM_PATH=0beyond documented legacy path.