Complete guide to configuring madengine for various use cases and environments.
madengine run --tags model \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'madengine run --tags model --additional-context-file config.jsonconfig.json:
{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"timeout_multiplier": 2.0
}madengine provides sensible defaults for common AMD/Ubuntu workflows:
| Field | Default Value | Customization |
|---|---|---|
gpu_vendor |
AMD |
Set to NVIDIA for NVIDIA GPUs |
guest_os |
UBUNTU |
Set to CENTOS for CentOS containers |
Defaults are applied during the build command when fields are not explicitly provided:
# Uses defaults: {"gpu_vendor": "AMD", "guest_os": "UBUNTU"}
madengine build --tags model
# Explicit override
madengine build --tags model \
--additional-context '{"gpu_vendor": "NVIDIA", "guest_os": "CENTOS"}'When defaults are applied, you'll see an informative message:
ℹ️ Using default values for build configuration:
• gpu_vendor: AMD (default)
• guest_os: UBUNTU (default)
💡 To customize, use --additional-context '{"gpu_vendor": "NVIDIA", "guest_os": "CENTOS"}'
You can provide one field and let the other default:
# Override only gpu_vendor (guest_os defaults to UBUNTU)
madengine build --tags model \
--additional-context '{"gpu_vendor": "NVIDIA"}'
# Override only guest_os (gpu_vendor defaults to AMD)
madengine build --tags model \
--additional-context '{"guest_os": "CENTOS"}'For production deployments:
- ✅ DO explicitly specify all configuration values
- ✅ DO use configuration files for reproducibility
⚠️ AVOID relying on defaults in automated workflows
The run command does NOT require these values because it can detect GPU vendor at runtime. Defaults only apply to the build command where Dockerfile selection requires them.
After a successful container run, madengine may scan the run log file for fixed substrings (for example RuntimeError:, OutOfMemoryError, Traceback (most recent call last)). If a match is found, the run can be marked FAILURE even when performance metrics exist—intended as a safety net when logs show obvious Python or OOM errors.
Some suites (for example layer unit tests) intentionally print benign RuntimeError: text while pytest still passes. In those cases you can disable the scan or narrow what counts as an error.
Keys can be set in --additional-context / --additional-context-file, or on the model entry in models.json (same keys). Runtime context overrides the model when both are set.
| Key | Type | Default | Description |
|---|---|---|---|
log_error_pattern_scan |
bool or string/number (coerced) | true |
If false, skip substring-based log failure detection entirely (rely on exit codes and other signals). |
log_error_benign_patterns |
array of strings | [] |
Extra lines to exclude before matching (appended to built-in exclusions such as ROCProf/metrics noise). Model list is merged first, then context list. |
log_error_patterns |
array of strings (non-empty) | (built-in list) | If set, replaces the default pattern list. Use only when you need a custom allowlist of failure substrings. |
Example — disable scan for a tag (pytest is authoritative):
madengine run --tags my_unit_test_suite \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU", "log_error_pattern_scan": false}'Example — extra benign substrings (prefer stable strings from real logs):
{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"log_error_benign_patterns": [
"expected benign fragment from workload log"
]
}Disabling the scan does not change performance metric extraction from the log; it only affects the post-hoc grep used to set has_errors for status.
gpu_vendor (case-insensitive):
"AMD"- AMD ROCm GPUs"NVIDIA"- NVIDIA CUDA GPUs
guest_os (case-insensitive):
"UBUNTU"- Ubuntu Linux"CENTOS"- CentOS Linux
When ROCm is not installed under /opt/rocm (e.g. TheRock or pip), set the ROCm root so GPU detection and container environment use the correct paths. Use the run command option or environment variable (not JSON context):
- CLI:
madengine run --rocm-path /path/to/rocm ... - Environment:
export ROCM_PATH=/path/to/rocm
Resolution order: --rocm-path → ROCM_PATH → /opt/rocm. This applies only to the run phase; build does not perform GPU detection.
Use batch manifest files for selective builds with per-model configuration:
madengine build --batch-manifest batch.json \
--registry my-registry.com \
--additional-context-file config.jsonBatch manifest structure (batch.json):
[
{
"model_name": "model1",
"build_new": true,
"registry": "registry1.io",
"registry_image": "namespace/model1"
},
{
"model_name": "model2",
"build_new": false,
"registry": "registry2.io",
"registry_image": "namespace/model2"
}
]Fields:
model_name(string, required): Model tag to includebuild_new(boolean, optional, default:false): Whether to build this modeltrue: Build the model from sourcefalse: Reference existing image without rebuilding
registry(string, optional): Per-model registry overrideregistry_image(string, optional): Custom registry image name/namespace
Key Behaviors:
- Only models with
"build_new": trueare built - Models with
"build_new": falseare included in output manifest without building - Per-model
registryoverrides the global--registryflag - Cannot use
--batch-manifestand--tagstogether (mutually exclusive)
Use Case - CI/CD Incremental Builds:
[
{"model_name": "changed_model", "build_new": true},
{"model_name": "stable_model1", "build_new": false},
{"model_name": "stable_model2", "build_new": false}
]This allows you to rebuild only changed models while maintaining references to existing stable images in a single manifest.
Pass environment variables to containers:
{
"docker_env_vars": {
"HSA_ENABLE_SDMA": "0",
"PYTORCH_TUNABLEOP_ENABLED": "1",
"NCCL_DEBUG": "INFO"
}
}Override Docker base image:
{
"MAD_CONTAINER_IMAGE": "rocm/pytorch:custom-tag"
}Or override BASE_DOCKER in FROM line:
{
"docker_build_arg": {
"BASE_DOCKER": "rocm/pytorch:rocm6.1_ubuntu22.04_py3.10"
}
}Pass build-time variables:
{
"docker_build_arg": {
"ROCM_VERSION": "6.1",
"PYTHON_VERSION": "3.10",
"CUSTOM_ARG": "value"
}
}Mount host directories inside containers:
{
"docker_mounts": {
"/data-inside-container": "/data-on-host",
"/models": "/home/user/models"
}
}Specify GPU and CPU subsets:
{
"docker_gpus": "0,2-4,7",
"docker_cpus": "0-15,32-47"
}Format: Comma-separated list with hyphen ranges.
{
"timeout_multiplier": 2.0
}Or use command-line option:
madengine run --tags model --timeout 7200Force local data caching:
{
"mirrorlocal": "/tmp/local_mirror"
}Or use command-line option:
madengine run --tags model --force-mirror-local /tmp/mirror{
"k8s": {
"gpu_count": 1
}
}Automatically applies (see presets under src/madengine/deployment/presets/k8s/):
- Namespace:
default - Resource limits based on GPU count
- Image pull policy:
Always(base default) - Service account:
default - GPU vendor detection from context
k8s.secretsdefaults (see below)
{
"k8s": {
"gpu_count": 2,
"namespace": "ml-team",
"gpu_vendor": "AMD",
"memory": "32Gi",
"memory_limit": "64Gi",
"cpu": "16",
"cpu_limit": "32",
"service_account": "madengine-sa",
"image_pull_policy": "Always",
"ttl_seconds_after_finished": null,
"allow_privileged_profiling": null,
"secrets": {
"strategy": "from_local_credentials",
"image_pull_secret_names": ["my-registry-secret"],
"runtime_secret_name": null
}
}
}K8s Options:
gpu_count- Number of GPUs (required)namespace- Kubernetes namespace (default:default)gpu_vendor- GPU vendor override (auto-detected from context)memory- Memory request (default: auto-scaled by GPU count)memory_limit- Memory limit (default: 2× memory request)cpu- CPU cores request (default: auto-scaled by GPU count)cpu_limit- CPU cores limit (default: 2× CPU request)service_account- Service account nameimage_pull_policy-Always,IfNotPresent, orNeverttl_seconds_after_finished- Optional Job TTL in seconds (auto-delete finished Job);nullto omitallow_privileged_profiling-nullmeans enable elevatedsecurityContextwhen tools/profiling are configured;true/falseto forcesecrets.strategy-from_local_credentials(default): createSecretobjects from localcredential.jsonat deploy time;existing: only reference pre-created Secrets;omit: no runtime Secret from clientsecrets.image_pull_secret_names- Extra pull secret names (strings) merged with any created fromcredential.jsonwhen usingfrom_local_credentialssecrets.runtime_secret_name- Required forexisting(pre-created opaque Secret with keycredential.json); optional foromitif you still mount a runtime Secret
{
"k8s": {
"gpu_count": 8
},
"distributed": {
"launcher": "torchrun",
"nnodes": 2,
"nproc_per_node": 4
}
}{
"slurm": {
"partition": "gpu",
"gpus_per_node": 4,
"time": "02:00:00"
}
}{
"slurm": {
"partition": "gpu",
"account": "research_group",
"qos": "normal",
"gpus_per_node": 8,
"nodes": 2,
"nodelist": "node01,node02",
"time": "24:00:00",
"mem": "64G",
"mail_user": "user@example.com",
"mail_type": "ALL"
}
}Note: nodelist is optional; omit it to let SLURM choose nodes. When set, the job runs only on the listed nodes and node health preflight is skipped.
SLURM Options:
partition- SLURM partition name (required)account- Billing accountqos- Quality of Servicegpus_per_node- GPUs per node (default: 1)nodes- Number of nodes (default: 1)nodelist- Comma-separated node names to run on (e.g."node01,node02"); when set, job is restricted to these nodes and automatic node health preflight is skippedtime- Wall time limit HH:MM:SS (required)mem- Memory per node (e.g., "64G")mail_user- Email for notificationsmail_type- Notification types (BEGIN, END, FAIL, ALL)
{
"slurm": {
"partition": "gpu",
"nodes": 4,
"gpus_per_node": 8,
"time": "48:00:00"
},
"distributed": {
"launcher": "torchrun",
"nnodes": 4,
"nproc_per_node": 8
}
}{
"distributed": {
"launcher": "torchrun",
"nnodes": 2,
"nproc_per_node": 4,
"master_port": 29500
}
}Launcher Options:
launcher- Framework name (required)nnodes- Number of nodesnproc_per_node- Processes/GPUs per nodemaster_port- Master communication port (default: 29500)
Supported Launchers:
torchrun- PyTorch DDP/FSDPdeepspeed- ZeRO optimizationmegatron- Large transformers (K8s + SLURM)torchtitan- LLM pre-trainingvllm- LLM inferencesglang- Structured generation
See Launchers Guide for details.
{
"distributed": {
"launcher": "torchtitan",
"nnodes": 4,
"nproc_per_node": 8
},
"env_vars": {
"TORCHTITAN_TENSOR_PARALLEL_SIZE": "8",
"TORCHTITAN_PIPELINE_PARALLEL_SIZE": "4",
"TORCHTITAN_FSDP_ENABLED": "1"
}
}{
"distributed": {
"launcher": "vllm",
"nnodes": 2,
"nproc_per_node": 4
},
"vllm": {
"tensor_parallel_size": 4,
"pipeline_parallel_size": 1
}
}{
"tools": [
{"name": "rocprof"}
]
}{
"tools": [
{
"name": "rocprof",
"cmd": "rocprof --timestamp on",
"env_vars": {
"NCCL_DEBUG": "INFO"
}
}
]
}{
"tools": [
{"name": "rocprof"},
{"name": "miopen_trace"},
{"name": "rocblas_trace"}
]
}Available Tools:
rocprof- GPU profilingrpd- ROCm Profiler Datarocblas_trace- rocBLAS library tracingmiopen_trace- MIOpen library tracingtensile_trace- Tensile library tracingrccl_trace- RCCL communication tracinggpu_info_power_profiler- Power consumption profilinggpu_info_vram_profiler- VRAM usage profiling
See Profiling Guide for details.
Run scripts before and after model execution:
{
"pre_scripts": [
{
"path": "scripts/common/pre_scripts/setup.sh",
"args": "-v"
}
],
"encapsulate_script": "scripts/common/wrapper.sh",
"post_scripts": [
{
"path": "scripts/common/post_scripts/cleanup.sh",
"args": "-r"
}
]
}Pass arguments to model execution script:
{
"model_args": "--model_name_or_path bigscience/bloom --batch_size 32"
}Configure in data.json (MAD package root):
{
"data_sources": {
"model_data": {
"nas": {"path": "/home/datum"},
"minio": {"path": "s3://datasets/datum"},
"aws": {"path": "s3://datasets/datum"}
}
},
"mirrorlocal": "/tmp/local_mirror"
}Configure in credential.json (MAD package root):
{
"dockerhub": {
"username": "your_username",
"password": "your_token",
"repository": "myorg"
},
"AMD_GITHUB": {
"username": "github_username",
"password": "github_token"
},
"MAD_AWS_S3": {
"username": "aws_access_key",
"password": "aws_secret_key"
}
}export MAD_DOCKERHUB_USER=myusername
export MAD_DOCKERHUB_PASSWORD=mytoken
export MAD_DOCKERHUB_REPO=myorgFor Kubernetes/SLURM deployments:
- CLI overrides (
--additional-context) - Highest - User config file (
--additional-context-file) - Profile presets (single-gpu/multi-gpu/multi-node)
- GPU vendor presets (AMD/NVIDIA optimizations)
- Base defaults (k8s/defaults.json)
- Environment variables
- Built-in fallbacks - Lowest
{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"docker_gpus": "0",
"docker_env_vars": {
"PYTORCH_TUNABLEOP_ENABLED": "1"
}
}{
"k8s": {
"gpu_count": 1,
"namespace": "dev"
}
}{
"k8s": {
"gpu_count": 4,
"memory": "64Gi",
"cpu": "32"
},
"distributed": {
"launcher": "torchrun",
"nnodes": 1,
"nproc_per_node": 4
}
}{
"slurm": {
"partition": "gpu",
"nodes": 8,
"gpus_per_node": 8,
"time": "72:00:00",
"account": "research_proj"
},
"distributed": {
"launcher": "deepspeed",
"nnodes": 8,
"nproc_per_node": 8
}
}{
"k8s": {
"gpu_count": 2,
"namespace": "production",
"memory": "32Gi"
},
"tools": [
{"name": "rocprof"},
{"name": "gpu_info_power_profiler"}
],
"docker_env_vars": {
"NCCL_DEBUG": "INFO",
"PYTORCH_TUNABLEOP_ENABLED": "1"
}
}# Verify configuration is valid JSON
python -m json.tool config.json
# Use verbose logging
madengine run --tags model \
--additional-context-file config.json \
--verbose# Check environment variables
env | grep MAD
# Verify Docker receives env vars
docker inspect container_name | grep -A 10 Envmadengine auto-detects GPU vendor if not specified:
- Looks for ROCm drivers → AMD
- Looks for CUDA drivers → NVIDIA
- Falls back to configuration or fails
Override with explicit configuration:
{
"gpu_vendor": "AMD"
}- Use configuration files for complex settings
- Start with minimal configs and add as needed
- Validate JSON syntax before running
- Use environment variables for sensitive data
- Test locally first before deploying
- Enable verbose logging when debugging
- Document custom configurations for team use
- Usage Guide - Using madengine commands
- Deployment Guide - Deploy to clusters
- Profiling Guide - Performance analysis
- Launchers Guide - Distributed training frameworks