Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Changed

- **ROCm path**: Auto-detect host ROCm root by default (traditional `/opt/rocm`, versioned `/opt/rocm-*`, TheRock `rocm-sdk` + markers, `rocminfo` / `amd-smi` / `rocm-smi` on `PATH`). **Host** overrides: top-level `MAD_ROCM_PATH` in `--additional-context`. **Container** `ROCM_PATH` for Docker runs (AMD): `docker_env_vars.MAD_ROCM_PATH` if set; else `ROCM_PATH` / `ROCM_HOME` from the image OCI config (`docker image inspect`); else an in-image shell probe (`docker run --rm`); else default `/opt/rocm` with a warning. The host-resolved path is no longer mirrored into the container by default. Set `MAD_AUTO_ROCM_PATH=0` to use legacy host `ROCM_PATH` / `/opt/rocm` only without scanning. Code: `madengine.utils.rocm_path_resolver`, shared TheRock file markers in `madengine.utils.therock_markers`.
- **Profiling**: `rocm_trace_lite` now sets `RTL_MODE=lite` explicitly; added tool `rocm_trace_lite_default` with `RTL_MODE=default` for A/B overhead comparison. `rtl_trace_wrapper.sh` passes `rtl trace --mode …` when `RTL_MODE` is set.

## [2.0.1] - 2026-04-23
Expand Down
11 changes: 7 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep
- **🎯 Simple Deployment** - Run locally or deploy to Kubernetes/SLURM via configuration
- **🔧 Distributed Launchers** - Full support for torchrun, DeepSpeed, Megatron-LM, TorchTitan, Primus, vLLM, SGLang
- **🐳 Container-Native** - Docker-based execution with GPU support (ROCm, CUDA)
- **📂 ROCm Path** - Support for non-default ROCm installs via `--rocm-path` or `ROCM_PATH` (e.g. Rock, pip)
- **📂 ROCm Path** - Auto-detect **host** ROCm root (override with top-level `MAD_ROCM_PATH`); in-container `ROCM_PATH` is set independently via `docker_env_vars.MAD_ROCM_PATH` and resolved at Docker run (image OCI env + in-image probe, not host mirroring) — see [Configuration](docs/configuration.md#rocm-path-run-only)
- **📊 Performance Tools** - Integrated profiling with rocprof/rocprofv3, [rocm-trace-lite](https://github.com/sunway513/rocm-trace-lite) (RTL), rocblas, MIOpen, RCCL tracing
- **🎯 ROCprofv3 Profiles** - 8 pre-configured profiles for compute/memory/communication bottleneck analysis
- **🔍 Environment Validation** - TheRock ROCm detection and validation tools
Expand Down Expand Up @@ -71,11 +71,14 @@ madengine run --tags dummy \

> **Note**: For build operations, `gpu_vendor` defaults to `AMD` and `guest_os` defaults to `UBUNTU` if not specified. For production deployments or non-AMD/Ubuntu environments, explicitly specify these values.

If ROCm is not installed under `/opt/rocm` (e.g. Rock or pip install), use `--rocm-path` or set `ROCM_PATH`:
If auto-detection does not find your **host** ROCm root, set top-level `MAD_ROCM_PATH` in `--additional-context`. For a different ROCm root **inside the container**, set `docker_env_vars.MAD_ROCM_PATH` in additional context. If you omit it, madengine derives in-container `ROCM_PATH` when running Docker (from the image's baked-in env, then an in-container probe, then `/opt/rocm` — it does **not** copy the host path). You can also set `ROCM_PATH` / `MAD_AUTO_ROCM_PATH=0` for **host** behavior as documented in [docs/configuration.md](docs/configuration.md):

```bash
madengine run --tags dummy --rocm-path /path/to/rocm
# Override host ROCm root:
madengine run --tags dummy --additional-context '{"MAD_ROCM_PATH": "/path/to/rocm"}'
# or: export ROCM_PATH=/path/to/rocm && madengine run --tags dummy
# Override in-container ROCm root independently:
madengine run --tags dummy --additional-context '{"docker_env_vars": {"MAD_ROCM_PATH": "/path/in/container"}}'
```

**Results:** Performance data is written to `perf.csv` (and optionally `perf_entry.csv`). The file is created automatically if missing. Failed runs (including pre-run setup failures) are recorded with status `FAILURE` so every attempted model appears in the table. See [Exit Codes](docs/cli-reference.md#exit-codes) for CI/script usage.
Expand Down Expand Up @@ -619,7 +622,7 @@ madengine run --tags model --keep-alive
madengine build --tags model --clean-docker-cache --verbose
```

**ROCm not in /opt/rocm:** If you use a custom ROCm location (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set `ROCM_PATH` or pass `--rocm-path` to `madengine run` so GPU detection and container env use the correct paths.
**ROCm not in /opt/rocm:** Set top-level `MAD_ROCM_PATH` in `--additional-context` for the **host**; for **in-container** paths, set `docker_env_vars.MAD_ROCM_PATH`, or let madengine resolve `ROCM_PATH` at run from the image and probe (see [Configuration](docs/configuration.md#rocm-path-run-only)).

**Common Issues:**
- **False failures with profiling**: If models show FAILURE but have performance metrics, see [Profiling Troubleshooting](docs/profiling.md#false-failure-detection-with-rocprof)
Expand Down
14 changes: 9 additions & 5 deletions docs/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,6 @@ madengine run [OPTIONS]
|--------|-------|------|---------|-------------|
| `--tags` | `-t` | TEXT | `[]` | Model tags to run (can specify multiple) |
| `--manifest-file` | `-m` | TEXT | `""` | Build manifest file path (for pre-built images) |
| `--rocm-path` | | TEXT | `None` | ROCm installation root (default: `ROCM_PATH` env or `/opt/rocm`). Use when ROCm is not in `/opt/rocm` (e.g. Rock, pip). |
| `--registry` | `-r` | TEXT | `None` | Docker registry URL |
| `--timeout` | | INT | `-1` | Timeout in seconds (-1=default 7200s, 0=no timeout) |
| `--additional-context` | `-c` | TEXT | `"{}"` | Additional context as JSON string |
Expand Down Expand Up @@ -240,9 +239,13 @@ madengine run [OPTIONS]
madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

# Custom ROCm path (when ROCm is not in /opt/rocm, e.g. Rock or pip install)
madengine run --tags dummy --rocm-path /path/to/rocm \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
# Custom host ROCm path (when ROCm is not in /opt/rocm, e.g. TheRock or pip install)
madengine run --tags dummy \
--additional-context '{"MAD_ROCM_PATH": "/path/to/rocm", "gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

# Custom in-container ROCm path (independent from host)
madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU", "docker_env_vars": {"MAD_ROCM_PATH": "/path/in/container"}}'

# Run with pre-built images (manifest-based)
madengine run --manifest-file build_manifest.json
Expand Down Expand Up @@ -617,7 +620,8 @@ madengine recognizes these environment variables:
| Variable | Description | Default |
|----------|-------------|---------|
| `MODEL_DIR` | Path to MAD package directory | Auto-detected |
| `ROCM_PATH` | ROCm installation root (used when `--rocm-path` not set) | `/opt/rocm` |
| `ROCM_PATH` | **Host** ROCm installation root (fallback when `MAD_ROCM_PATH` is not set in additional context and auto-detect is disabled or finds nothing). In-container `ROCM_PATH` for Docker is not taken from this variable; set `docker_env_vars.MAD_ROCM_PATH` in additional context instead. | `/opt/rocm` |
| `MAD_AUTO_ROCM_PATH` | Set to `0` to disable **host** auto-detect (`ROCM_PATH` then `/opt/rocm` on the host). | (default: scan on) |
| `MAD_VERBOSE_CONFIG` | Enable verbose configuration logging | `false` |
| `MAD_DOCKERHUB_USER` | Docker Hub username | None |
| `MAD_DOCKERHUB_PASSWORD` | Docker Hub password/token | None |
Expand Down
18 changes: 14 additions & 4 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,12 +131,22 @@ Disabling the scan does **not** change performance metric extraction from the lo

### ROCm path (run only)

When ROCm is not installed under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set the ROCm root so GPU detection and container environment use the correct paths. Use the **run** command option or environment variable (not JSON context):
**Host** (where `madengine` runs validation): by default, the ROCm root is **auto-detected** (traditional `/opt/rocm`, [TheRock](https://github.com/ROCm/TheRock) `rocm-sdk` / manifest layout, or `ROCM_PATH`-like env hints). Set `MAD_AUTO_ROCM_PATH=0` to skip auto and use only legacy resolution (`ROCM_PATH` then `/opt/rocm`).

- **CLI:** `madengine run --rocm-path /path/to/rocm ...`
- **Environment:** `export ROCM_PATH=/path/to/rocm`
**Overrides** (recommended for CI):

Resolution order: `--rocm-path` → `ROCM_PATH` → `/opt/rocm`. This applies only to the run phase; build does not perform GPU detection.
- **Additional context (host):** top-level `"MAD_ROCM_PATH": "/path/to/host/rocm"` — controls where madengine looks for host GPU tools (`rocminfo`, `amd-smi`, etc.).
- **Additional context (container):** `"docker_env_vars": { "MAD_ROCM_PATH": "/path/inside/image" }` — sets the in-container `ROCM_PATH` for Docker runs. If omitted, at `run` time madengine uses the image OCI `Env` (`ROCM_PATH` / `ROCM_HOME`) if present, then an in-container probe, then defaults to `/opt/rocm`. The host-resolved path is **not** mirrored into the container.

These two keys are independent, allowing host and container to use different ROCm installations without confusion.

Precedence (host): top-level `MAD_ROCM_PATH` → auto-detect (unless disabled) → `ROCM_PATH` → `/opt/rocm`.

Precedence (container, **local Docker `run`**, **AMD**): `docker_env_vars.MAD_ROCM_PATH` (maps to `ROCM_PATH` for the workload) or explicit `ROCM_PATH` in `docker_env_vars` → image OCI `Env` (`ROCM_PATH` / `ROCM_HOME`) → in-image probe → default `/opt/rocm` with a warning. Implemented in `ContainerRunner.run_container` after the run image is resolved.

This applies to the run phase; build uses build-only context (no GPU detection) but still honors `MAD_ROCM_PATH` in context when set.

At the start of each container run, a **Run Phase Environment** table is printed showing host vs container installation type (`apt install` or `therock`), ROCm/CUDA root, and version side-by-side. See [Run phase environment table](usage.md#run-phase-environment-table).

## Build Configuration

Expand Down
4 changes: 2 additions & 2 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
```

**Non-default ROCm location:** If ROCm is not under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip install), set `ROCM_PATH` or use `madengine run --rocm-path /path/to/rocm` so GPU detection and container env use the correct paths.
**Non-default ROCm location (host):** If ROCm is not under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip install), set `ROCM_PATH` on the **host** or set top-level `MAD_ROCM_PATH` in `--additional-context` so **host** GPU checks (`amd-smi`, etc.) resolve correctly. **In-container** `ROCM_PATH` for Docker workloads is set separately at run (image OCI env, in-image probe, or `docker_env_vars.MAD_ROCM_PATH`); it is not copied from the host. See [Configuration — ROCm path](configuration.md#rocm-path-run-only).

### NVIDIA CUDA

Expand Down Expand Up @@ -140,7 +140,7 @@ rocm-smi
ls -la /dev/kfd /dev/dri
```

If ROCm is installed in a non-default path (e.g. Rock or pip), set `export ROCM_PATH=/path/to/rocm` or use `madengine run --rocm-path /path/to/rocm`.
If ROCm is installed in a non-default path on the **host** (e.g. TheRock or pip), set `export ROCM_PATH=/path/to/rocm` or pass `MAD_ROCM_PATH` in `--additional-context` (host validation only; see [ROCm path (run only)](configuration.md#rocm-path-run-only) for in-container behavior).

### MAD Package Not Found

Expand Down
56 changes: 46 additions & 10 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,21 +309,56 @@ madengine run --tags model \
- `gpu_vendor`: "AMD", "NVIDIA"
- `guest_os`: "UBUNTU", "CENTOS"

### ROCm path (non-default installs)
### ROCm path (host and container)

When ROCm is not installed under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set the ROCm root so GPU detection and container environment use the correct paths:
By default, **madengine** auto-detects the **host** ROCm root (apt under `/opt/rocm`, TheRock `rocm-sdk` layout, etc.). Disable scanning with `MAD_AUTO_ROCM_PATH=0` (then `ROCM_PATH` / `/opt/rocm` only).

**Host** override: set top-level `MAD_ROCM_PATH` in `--additional-context` to tell madengine where host GPU tools live (`rocminfo`, `amd-smi`, etc.).

**In-container** `ROCM_PATH` (AMD Docker runs) is **not** copied from the host. If you do not set `docker_env_vars.MAD_ROCM_PATH` (or a literal `ROCM_PATH` in `docker_env_vars`), madengine sets it at **run** time from, in order: the image OCI `Env` (`ROCM_PATH` / `ROCM_HOME` via `docker image inspect`), an in-container probe (`docker run --rm`), or `/opt/rocm` with a warning. Override explicitly with `{"docker_env_vars": {"MAD_ROCM_PATH": "/path/inside/image"}}` when the image needs a fixed root. Details: [Configuration — ROCm path](configuration.md#rocm-path-run-only).

The two keys are independent — host and container can point to different ROCm installations:

```bash
# Via environment variable
export ROCM_PATH=/path/to/rocm
madengine run --tags model --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
# Override host ROCm root only
madengine run --tags model \
--additional-context '{"MAD_ROCM_PATH": "/path/to/host/rocm", "gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

# Via CLI (overrides ROCM_PATH)
madengine run --tags model --rocm-path /path/to/rocm \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
# Override host and container paths independently
madengine run --tags model --additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"MAD_ROCM_PATH": "/path/to/host/rocm",
"docker_env_vars": {"MAD_ROCM_PATH": "/opt/rocm"}
}'
```

`--rocm-path` applies only to the **run** command (not build). See [CLI Reference - run](cli-reference.md#run---execute-models).
See [Configuration - ROCm path](configuration.md#rocm-path-run-only).

### Run phase environment table

At the start of each container run, madengine prints a side-by-side environment summary for the **host** and the **container**:

```
🖥️ RUN PHASE ENVIRONMENT
================================================================================
HOST CONTAINER
──────────────────────────────────────────────────────────────────────────────
GPU Vendor AMD AMD
Installation Type apt install therock
ROCm Root /opt/rocm /opt/python/lib/python3.13/site-packages/_rocm_sdk_devel
ROCm Version 6.4.0 7.13.0a20260415
──────────────────────────────────────────────────────────────────────────────
================================================================================
```

| Field | AMD values | NVIDIA values |
|---|---|---|
| Installation Type | `apt install` (traditional `/opt/rocm`) or `therock` (Python-package layout) | `CUDA toolkit` |
| ROCm / CUDA Root | Resolved via `RocmPathResolver` (host) or `rocm-sdk path --root` (container) | `nvcc` binary location |
| ROCm / CUDA Version | From `amd-smi` / `rocm-sdk version` / `.info/version` | From `nvcc --version` / `nvidia-smi` |

The host column uses the same resolution as top-level `MAD_ROCM_PATH` in additional context. The container column queries the container's own `PATH` at runtime, so it correctly reflects TheRock images where tools live in a Python venv rather than `/opt/rocm/bin/`.

### Deploy to Kubernetes

Expand Down Expand Up @@ -616,7 +651,8 @@ madengine build --tags model --clean-docker-cache --verbose
| Variable | Description | Example |
|----------|-------------|---------|
| `MODEL_DIR` | MAD package directory | `/path/to/MAD` |
| `ROCM_PATH` | ROCm installation root (used when `--rocm-path` not set). Use when ROCm is not in `/opt/rocm` (e.g. Rock, pip). | `/path/to/rocm` |
| `ROCM_PATH` | **Host** ROCm root fallback when `MAD_ROCM_PATH` is not set in additional context and auto-detect is disabled or finds nothing. In-container `ROCM_PATH` for Docker is set separately at run; see [ROCm path (host and container)](#rocm-path-host-and-container). | `/path/to/rocm` |
| `MAD_AUTO_ROCM_PATH` | Set to `0` to disable **host** auto-detect (use `ROCM_PATH` then `/opt/rocm` only on the host). Default: on. | `0` |
| `MAD_VERBOSE_CONFIG` | Verbose config logging | `"true"` |
| `MAD_DOCKERHUB_USER` | Docker Hub username | `"myusername"` |
| `MAD_DOCKERHUB_PASSWORD` | Docker Hub password | `"mytoken"` |
Expand Down
9 changes: 0 additions & 9 deletions src/madengine/cli/commands/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,13 +152,6 @@ def run(
help="Remove intermediate perf_entry files after run (keeps perf.csv and perf_super files)",
),
] = False,
rocm_path: Annotated[
Optional[str],
typer.Option(
"--rocm-path",
help="ROCm installation path (overrides ROCM_PATH env; default: /opt/rocm). Use when ROCm is not under /opt/rocm (e.g. Rock tar/whl).",
),
] = None,
) -> None:
"""
🚀 Run model containers in distributed scenarios.
Expand Down Expand Up @@ -237,7 +230,6 @@ def run(
disable_skip_gpu_arch=disable_skip_gpu_arch,
verbose=verbose,
cleanup_perf=cleanup_perf,
rocm_path=rocm_path,
skip_model_run=skip_model_run,
_separate_phases=True,
)
Expand Down Expand Up @@ -342,7 +334,6 @@ def run(
disable_skip_gpu_arch=disable_skip_gpu_arch,
verbose=verbose,
cleanup_perf=cleanup_perf,
rocm_path=rocm_path,
skip_model_run=skip_model_run,
_separate_phases=False, # Full workflow uses .live.log (not .run.live.log)
)
Expand Down
14 changes: 14 additions & 0 deletions src/madengine/cli/validators.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,15 @@ def validate_additional_context_structure(context: Dict[str, Any]) -> None:
):
_fail_structure("docker_env_vars", "an object")

dev = context.get("docker_env_vars")
if isinstance(dev, dict) and "MAD_ROCM_PATH" in dev:
v = dev["MAD_ROCM_PATH"]
if not isinstance(v, (str, type(None))):
_fail_structure(
"docker_env_vars['MAD_ROCM_PATH']",
"a string (container ROCm root override)",
Comment thread
coketaste marked this conversation as resolved.
)

if "docker_mounts" in context and not isinstance(context["docker_mounts"], dict):
_fail_structure("docker_mounts", "an object")

Expand Down Expand Up @@ -170,6 +179,11 @@ def validate_additional_context_structure(context: Dict[str, Any]) -> None:
if "guest_os" in context and not isinstance(context["guest_os"], str):
_fail_structure("guest_os", "a string")

if "MAD_ROCM_PATH" in context and not isinstance(
context["MAD_ROCM_PATH"], (str, type(None))
):
_fail_structure("MAD_ROCM_PATH", "a string (host ROCm root override)")

if "log_error_pattern_scan" in context and not isinstance(
context["log_error_pattern_scan"], (bool, str, int, float, type(None))
):
Expand Down
15 changes: 9 additions & 6 deletions src/madengine/core/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,17 +197,20 @@ def _get_public_github_rocm_key():


def get_rocm_path(override=None):
"""Return ROCm installation root directory.
"""Return ROCm installation root (legacy, no automatic filesystem scan).

Resolution order: override (e.g. from CLI) -> ROCM_PATH env -> default /opt/rocm.
Path is normalized to absolute form with no trailing slash.
For full resolution (MAD_ROCM_PATH, auto-detect) use
:func:`madengine.utils.rocm_path_resolver.resolve_host_rocm_path` in
:class:`~madengine.core.context.Context`.

Resolution: ``override`` -> :envvar:`ROCM_PATH` -> ``/opt/rocm``.

Args:
override: Optional path overriding env and default.

Returns:
str: Absolute ROCm root path.
"""
raw = override if override else os.environ.get("ROCM_PATH", "/opt/rocm")
path = os.path.abspath(os.path.expanduser(str(raw).strip()))
return path.rstrip(os.sep)
from madengine.utils.rocm_path_resolver import get_rocm_path_legacy

return get_rocm_path_legacy(override)
Loading