Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/example_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ jobs:
uses: ./.github/workflows/_example_tests_runner.yml
secrets: inherit
with:
docker_image: "nvcr.io/nvidia/nemo:26.02"
docker_image: "nvcr.io/nvidia/nemo:26.04"
example: megatron_bridge
timeout_minutes: 30
pip_install_extras: "[hf,puzzletron,dev-test]"
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Changelog

- Add offline DFlash speculative decoding training. Train the draft module from pre-computed base-model hidden states dumped by ``examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py``; base-model transformer layers are deleted after conversion to save memory. Controlled by the auto-derived ``dflash_offline`` flag on ``DFlashConfig`` (derived from ``data_args.offline_data_path``). The dump scripts now share ``collect_hidden_states/common.py`` for aux-layer selection (``--aux-layers eagle|dflash|<list>``) and optional assistant-token ``loss_mask`` for answer-only-loss training.
- Add ``--cast_mxfp4_to_nvfp4`` flag to ``examples/llm_ptq/hf_ptq.py`` for closed-form, bit-exact MXFP4 → NVFP4 weight conversion. Supports the GPT-OSS family (``openai/gpt-oss-20b``, ``openai/gpt-oss-120b``). See `examples/llm_ptq/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq#mxfp4--nvfp4-cast-for-gpt-oss>`__ for usage.
- Add support for ``active_params`` (for MoE models) and ``memory_mb`` constraints in Minitron pruning on top of existing ``params`` constraint. You can also provide multiple constraints. See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details. The underlying utility functions ``mcore_param_count``, ``mcore_memory_footprint_mb``, and ``print_mcore_model_stats`` in ``modelopt.torch.nas.plugins.megatron_model_stats`` are also available for standalone use to compute parameter counts and memory footprints (weights + KV-cache + Mamba state) for any Megatron-Core model.

0.44 (2026-05-xx)
^^^^^^^^^^^^^^^^^
Expand Down
48 changes: 43 additions & 5 deletions examples/megatron_bridge/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This directory contains examples of using Model Optimizer with [NeMo Megatron-Br

## Pre-Requisites

Running these examples requires many additional dependencies to be installed (e.g., Megatron-Bridge, Megatron-core, etc.), hence we strongly recommend directly using the NeMo container (e.g., `nvcr.io/nvidia/nemo:26.02`) which has all the dependencies installed.
Running these examples requires many additional dependencies to be installed (e.g., Megatron-Bridge, Megatron-core, etc.), hence we strongly recommend directly using the NeMo container (e.g., `nvcr.io/nvidia/nemo:26.04`) which has all the dependencies installed.

To get the ModelOpt examples scripts, mount your Model-Optimizer repo to the container as follows:

Expand All @@ -26,7 +26,7 @@ if [ ! -d "${MODELOPT_DIR}" ]; then
git clone https://github.com/NVIDIA/Model-Optimizer.git ${MODELOPT_DIR}
fi

export DOCKER_IMAGE=nvcr.io/nvidia/nemo:26.02
export DOCKER_IMAGE=nvcr.io/nvidia/nemo:26.04
docker run \
--gpus all \
--shm-size=16GB \
Expand All @@ -49,11 +49,28 @@ hf auth login --token <your token>
> [!WARNING]
> Use `python -m pip` instead of `pip` to avoid conflicts with the system-wide installed packages in the NeMo containers. You may also refer to this [doc](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/docker/common/README.md#installing-packages-inside-the-container) on how to correctly install packages in the NeMo containers without breaking existing torch installation.

Also install additional dependencies from the [requirements.txt](./requirements.txt) file.

```bash
python -m pip install -r requirements.txt
```

## Pruning

This section shows how to prune a HuggingFace model using Minitron algorithm in Megatron-Bridge framework. Checkout other available pruning algorithms, supported frameworks and models, and general pruning getting-started in the [pruning README](../pruning/README.md).

Example usage to prune Qwen3-8B to 6B on 2-GPUs (Pipeline Parallelism = 2) while skipping pruning of `num_attention_heads` using following defaults:
The script supports three NAS-based pruning targets and one manual export mode:

| Mode | Flag | Description |
| :---: | :---: | :--- |
| NAS | `--prune_target_params` | Prune to a target total parameter count |
| NAS | `--prune_target_active_params` | Prune to a target active parameter count (useful for MoE models). For non-MoE models, this is equivalent to `--prune_target_params`. |
| NAS | `--prune_target_memory_mb` | Prune to a target memory footprint in MB (weights + KV-cache) for a given batch size and sequence length assuming BF16 precision |
| Manual | `--prune_export_config` | Prune directly to a specified architecture config (no NAS). Useful if you want to take top K candidates and do a short knowledge distillation before selecting the best model. |

Multiple NAS targets can be combined — e.g. `--prune_target_params 6e9 --prune_target_memory_mb 12288` finds the best model with under 6B params and under 12GB memory footprint at (default) batch size 1 and sequence length 4096 assuming BF16 precision.

**Prune by total parameter count** — prune Qwen3-8B to 6B on 2-GPUs (Pipeline Parallelism = 2) while skipping pruning of `num_attention_heads` using following defaults:
1024 samples from [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) for calibration,
at-most 20% depth (`num_layers`) and 40% width is pruned per prunable hparam (`hidden_size`, `ffn_hidden_size`, ...),
top-10 candidates are evaluated for MMLU score (5% sampled data) to select the best model.
Expand All @@ -67,8 +84,29 @@ torchrun --nproc_per_node 2 prune_minitron.py \
--output_hf_path /tmp/Qwen3-8B-Pruned-6B
```

Example usage for manually pruning to a specific architecture using following defaults:
1024 samples from [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) for calibration.
**Prune by active parameter count** — useful for MoE models where most experts are inactive per token (e.g. prune Nemotron-3-Nano-30B-A3B-BF16 (3.6B active params) to 3B active params):

```bash
torchrun --nproc_per_node 2 prune_minitron.py \
--pp_size 2 \
--hf_model_name_or_path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--prune_target_active_params 3e9 \
--output_hf_path /tmp/Nemotron-3-Nano-30B-A3B-BF16-Pruned-3B-Active
```

**Prune by memory footprint** — prune to fit a target GPU memory budget (weights + KV-cache at the given sequence length and batch size, assuming BF16):

```bash
torchrun --nproc_per_node 2 prune_minitron.py \
--pp_size 2 \
--hf_model_name_or_path Qwen/Qwen3-8B \
--prune_target_memory_mb 12288 \
--seq_length 4096 \
--calib_mbs 1 \
--output_hf_path /tmp/Qwen3-8B-Pruned-12GB
```

**Manual pruning** — prune directly to a specified architecture (no NAS, no score evaluation):

```bash
torchrun --nproc_per_node 2 prune_minitron.py \
Expand Down
128 changes: 82 additions & 46 deletions examples/megatron_bridge/prune_minitron.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,11 @@
# limitations under the License.
"""Example script for pruning a GPT / Mamba model using Minitron algorithm on a Megatron-Bridge model (load from HF).

Supports three NAS-based pruning targets (can be combined):
--prune_target_params Total parameter count (e.g. 6e9 for 6B total params)
--prune_target_active_params Active parameter count for MoE models (e.g. 3e9 for 3B active params)
--prune_target_memory_mb Memory footprint in MB (uses --seq_length for KV-cache estimate, assumes BF16)

Example usage to prune Qwen3-8B to 6B on 2-GPUs (Pipeline Parallelism = 2)
while skipping pruning of num_attention_heads using following defaults:
1024 samples from nemotron-post-training-dataset-v2 for calibration,
Expand Down Expand Up @@ -47,7 +52,7 @@
import modelopt.torch.opt as mto
import modelopt.torch.prune as mtp
import modelopt.torch.utils.distributed as dist
from modelopt.torch.utils import get_supported_datasets, num2hrb, print_rank_0, warn_rank_0
from modelopt.torch.utils import get_supported_datasets, print_rank_0, warn_rank_0
from modelopt.torch.utils.plugins.mbridge import (
get_hf_mbridge_calibration_loop,
load_mbridge_model_from_hf,
Expand Down Expand Up @@ -105,7 +110,6 @@ def get_args() -> argparse.Namespace:
)
parser.add_argument("--calib_gbs", type=int, default=1, help="Calibration global batch size")
parser.add_argument("--seq_length", type=int, default=4096)

# Pruning parameters
parser.add_argument(
"--prune_intermediate_ckpt",
Expand All @@ -117,42 +121,59 @@ def get_args() -> argparse.Namespace:
),
)

target_group = parser.add_mutually_exclusive_group(required=True)
target_group.add_argument(
parser.add_argument(
"--prune_export_config",
type=str,
help=(
'Target pruned config as JSON e.g., \'{"hidden_size": 512, "ffn_hidden_size": 2048}\'. '
f"Supported hyperparameters: {mtp.mcore_minitron.SUPPORTED_HPARAMS}. "
"Cannot be used with --prune_target_params."
"Cannot be combined with NAS-based targets."
),
)
target_group.add_argument(
parser.add_argument(
"--prune_target_params",
type=float,
help=(
"Target parameter count for pruning e.g., 6e9 for pruning to 6B params (total params, not active params). "
"Uses Neural Architecture Search (NAS) to find the best pruned model that maximizes the --prune_score_func."
"Cannot be used with --prune_export_config."
"Target total parameter count e.g., 6e9 for 6B params. "
"Uses NAS to find the best pruned model that maximizes --prune_score_func. "
"Can be combined with --prune_target_active_params and/or --prune_target_memory_mb."
),
)
parser.add_argument(
"--prune_target_active_params",
type=float,
help=(
"Target active parameter count e.g., 3e9 for 3B active params (useful for MoE models). "
"Uses NAS to find the best pruned model that maximizes --prune_score_func. "
"Can be combined with --prune_target_params and/or --prune_target_memory_mb."
),
)
parser.add_argument(
"--prune_target_memory_mb",
type=float,
help=(
"Target memory footprint in MB (weights + KV-cache estimated via seq_length and calib_mbs; assumes BF16). "
"Uses NAS to find the best pruned model that maximizes --prune_score_func. "
"Can be combined with --prune_target_params and/or --prune_target_active_params."
),
Comment thread
kevalmorabia97 marked this conversation as resolved.
)

parser.add_argument(
"--prune_score_func",
type=str,
default="mmlu_10pct",
default="mmlu_10pct_bs1",
help=(
"Score function to use for NAS-based pruning (--prune_target_params). Only supports MMLU at the moment. "
"Format: mmlu_<N>pct where <N> is the percentage of MMLU data to sample per subject "
"(e.g. mmlu_10pct for 10%, mmlu_100pct for full eval)."
"Score function to use for NAS-based pruning. Only supports MMLU at the moment. "
"Format: mmlu_<N>pct_<bs> where <N> is the percentage of MMLU data to sample per subject and <bs> is "
"batch size for fast evaluation (default is mmlu_10pct_bs1)."
),
)
parser.add_argument(
"--ss_channel_divisor",
type=int,
default=None,
help=(
"hidden_size / ffn_hidden_size divisor for NAS-based pruning (--prune_target_params). "
"hidden_size / ffn_hidden_size divisor for NAS-based pruning. "
"Leave as None to use default divisors."
),
)
Expand All @@ -162,14 +183,14 @@ def get_args() -> argparse.Namespace:
default=0.4,
help=(
f"Maximum width pruning percentage ({mtp.mcore_minitron.SUPPORTED_HPARAMS - {'num_layers'}}) "
"for NAS-based pruning (--prune_target_params)"
"for NAS-based pruning"
),
)
parser.add_argument(
"--max_depth_pruning",
type=float,
default=0.2,
help="Maximum depth pruning percentage ('num_layers') for NAS-based pruning (--prune_target_params)",
help="Maximum depth pruning percentage ('num_layers') for NAS-based pruning",
)
parser.add_argument(
"--hparams_to_skip",
Expand All @@ -178,7 +199,7 @@ def get_args() -> argparse.Namespace:
default=[],
choices=mtp.mcore_minitron.SUPPORTED_HPARAMS,
help=(
"Space-separated list of hparams to skip for NAS-based pruning (--prune_target_params) "
"Space-separated list of hparams to skip for NAS-based pruning "
"e.g. dont prune 'num_attention_heads'"
),
)
Expand All @@ -187,13 +208,27 @@ def get_args() -> argparse.Namespace:
type=int,
default=10,
help=(
"Number of top candidates to consider for NAS-based pruning (--prune_target_params). "
"Number of top candidates to consider for NAS-based pruning. "
"Higher values will take longer to prune but may find a better model."
),
)

args = parser.parse_args()

# Validate pruning target arguments
_nas_targets = [
args.prune_target_params,
args.prune_target_active_params,
args.prune_target_memory_mb,
]
if args.prune_export_config and any(t is not None for t in _nas_targets):
parser.error("--prune_export_config cannot be combined with NAS-based targets.")
if not args.prune_export_config and not any(t is not None for t in _nas_targets):
parser.error(
"At least one of --prune_export_config, --prune_target_params,"
" --prune_target_active_params, or --prune_target_memory_mb is required."
)

# Post-process arguments
if args.prune_intermediate_ckpt is None:
if args.output_megatron_path:
Expand Down Expand Up @@ -250,11 +285,6 @@ def main(args: argparse.Namespace):
init_model_parallel=True,
moe_grouped_gemm=False,
)
print_rank_0(f"\nPruning model (showing PP rank0): {unwrapped_model}")
print_rank_0(
f"Original model params: {num2hrb(mtp.mcore_minitron.get_mcore_param_count(unwrapped_model))}"
)

forward_loop = get_hf_mbridge_calibration_loop(
model=model,
provider=provider,
Expand All @@ -271,10 +301,20 @@ def main(args: argparse.Namespace):
"forward_loop": forward_loop,
"checkpoint": args.prune_intermediate_ckpt,
}
if args.prune_target_params is not None:
# Restrict search space to a smaller set of candidates
# Allow more choices for MoE FFN as they are generally smaller
# NOTE: You can reduce the divisors and increase config['top_k'] to potentially find a better model.
if args.prune_export_config is not None:
# Less restrictive search space for manual pruning
ss_config = mtp.mcore_minitron.get_mcore_minitron_config(
hidden_size_divisor=64,
ffn_hidden_size_divisor=64,
mamba_head_dim_divisor=8,
num_moe_experts_divisor=8,
num_layers_divisor=1,
)
pruning_constraints = {"export_config": args.prune_export_config}
else:
# NAS-based pruning: restrict search space to a smaller set of candidates.
# Allow more choices for MoE FFN as they are generally smaller.
# NOTE: Reduce divisors and increase config['top_k'] to potentially find a better model.
hidden_size_divisor = args.ss_channel_divisor if args.ss_channel_divisor else 256
ffn_hidden_size_divisor = (
args.ss_channel_divisor
Expand All @@ -290,40 +330,40 @@ def main(args: argparse.Namespace):
)
print_rank_0(f"Using search space config: {ss_config}")

pruning_constraints = {"params": args.prune_target_params}
pruning_constraints = {}
if args.prune_target_params is not None:
pruning_constraints["params"] = args.prune_target_params
if args.prune_target_active_params is not None:
pruning_constraints["active_params"] = args.prune_target_active_params
if args.prune_target_memory_mb is not None:
pruning_constraints["memory_mb"] = args.prune_target_memory_mb

print_rank_0(
f"Using NAS-based automatic pruning with score function: {args.prune_score_func}. "
"You can change this to be any other metric you want to maximize (e.g. negative validation loss)."
)

match = re.fullmatch(r"mmlu_(\d+)pct", args.prune_score_func)
match = re.fullmatch(r"mmlu_(\d+)pct_bs(\d+)", args.prune_score_func)
if not match:
raise ValueError(
f"Invalid score function: {args.prune_score_func}. Expected format: mmlu_<N>pct (e.g. mmlu_10pct)"
f"Invalid score function: {args.prune_score_func}. Expected format: mmlu_<N>pct_bs<bs>"
)
mmlu_frac = float(match.group(1)) / 100.0
batch_size = int(match.group(2))

def score_func(m):
return megatron_mmlu(
m, tokenizer, few_shots=0, fraction=mmlu_frac, batch_size=args.calib_mbs
m, tokenizer, few_shots=0, fraction=mmlu_frac, batch_size=batch_size
)

pruning_config["score_func"] = score_func
pruning_config["max_width_pruning"] = args.max_width_pruning
pruning_config["max_depth_pruning"] = args.max_depth_pruning
pruning_config["hparams_to_skip"] = args.hparams_to_skip
pruning_config["top_k"] = args.top_k
elif args.prune_export_config is not None:
# Less restrictive search space for manual pruning
ss_config = mtp.mcore_minitron.get_mcore_minitron_config(
hidden_size_divisor=64,
ffn_hidden_size_divisor=64,
mamba_head_dim_divisor=8,
num_moe_experts_divisor=8,
num_layers_divisor=1,
)

pruning_constraints = {"export_config": args.prune_export_config}
# memory_mb constraint requires batch_size and seq_length
pruning_config["batch_size"] = args.calib_mbs
pruning_config["seq_length"] = args.seq_length
print_rank_0(f"Pruning constraints: {pruning_constraints}")

unwrapped_model, pruning_scores = mtp.prune( # in-place pruning
Expand All @@ -343,10 +383,6 @@ def score_func(m):
else "hybrid_layer_pattern"
)
setattr(provider, hybrid_key, getattr(unwrapped_model, hybrid_key))
print_rank_0(f"\nPruned model (showing PP rank0): {unwrapped_model}")
print_rank_0(
f"Pruned model params: {num2hrb(mtp.mcore_minitron.get_mcore_param_count(unwrapped_model))}"
)

if args.output_megatron_path is not None:
print_rank_0(
Expand Down
2 changes: 2 additions & 0 deletions examples/megatron_bridge/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Saving some pruned models (e.g. Nemotron-3-Nano-30B-A3B-BF16) have issues with transformers>=5.0
transformers<5.0
Loading
Loading