Skip to content

Hands-on Megatron-LM tutorials on ablating parallelism and scaling trends. DP → ZeRO → TP → SP → CP → PP → VPP → EP

License

Notifications You must be signed in to change notification settings

vuiseng9/megatron-tutorials

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Megatron, Transformed! 😎

A Hands-on Megatron-LM Tutorial on Replicating Empirical Trends in Distributed Training and Model Parallelism

Megatron-LM is, without question, one of the most impressive frameworks advancing and enabling ultra-large model training. But just how many arguments are there in pretrain_gpt.py? Over 500. 😅

How many GPUs do we need to get started? How and what should we set? Which axis of parallelism should we scale first? Why am I getting out-of-memory (OOM) all the time? Okay, it’s finally running… but is it even correct? is the performance as expected?

This tutorial series is written to address those pain points. It provides a set of curated, ready-to-run (hack) experiments designed to reduce the friction of getting started with Megatron-LM. Each tutorial explains the core concept succinctly, ablates one parallelism strategy at a time. Wherever possible, we align with the main paper to reproduce the reported scaling trends to verify correctness and understanding.

All experiments are designed to run on a single node with 8×H100 80 GB GPUs. Performance and memory metrics are meticulously tabulated for your reference and verification. The goal is to make Megatron-LM more accessible, reproducible, and tinker-friendly. Let's get started!

Explore:


Setup and Run

We rely on the out-of-the-box Megatron-LM's pretrain_gpt.py and use bash scripts to wrap it with different parallelism strategies. We also use Makefile extensively to organize and manage the experiments.

Easiest way to get started is to use prebuilt docker image:

docker run -d --gpus all -it --rm \
  --network=host --ipc=host \
  --shm-size=16g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  vuiseng9/megatron-tutorials

Or build using docker/Dockerfile.

How to run? Just make <id>-tab-completion

  • The docker entrypoint will lead to working directory, /workspace/megatron-lm/examples/gpt3
  • Each experiment is defined as a Makefile target prefixed with a unique id. Our intent is to reduce the number of bash scripts, steps, arguments, making the runs less friction to reproduce, basically just type make then the id and finally a tab to complete the target. e.g. type make 101<tab> turn into make 101-gpt2xl-dp1-gbs1-bf16. Each result in our tables also comes their corresponding unique id.
  • Metrics can be found in std output which is also logged to ./outdir/<experiment label>/logs.txt. To see GPU memory usage, run monitor-gpu on a seperate terminal. Runs are preconfigured to stop after 100 training steps.

System requirements: Our results are collected on a single node with 8x H100 80GB SXM5 (NVLink) GPUs. 8xA100 80GB GPUs should give similar trend. PCIe GPUs will run, but unproductively slow.

$ make 
101-gpt2xl-dp1-gbs1-bf16                 318-cp8-gpt2-1.2B-gbs8-len4096-ag
104-gpt2xl-dp4-gbs4                      328-cp8-gpt2-1.2B-gbs8-len4096-a2a
111-gpt2xl-dp1-fit-80GB                  338-cp8-gpt2-1.2B-gbs8-len16k
112-gpt2xl-dp1-fit-80GB-GA4              348-cp8-gpt2-1.2B-gbs8-len16k-ag
114-gpt2xl-dp4-fit-4x80GB                358-cp8-gpt2-1.2B-gbs8-len16k-a2a
115-gpt2xl-dp4-gbs60-zero2               401-gpt2-8.3B-pp8-m1
118-gpt2xl-dp8-fit-8x80GB                402-gpt2-8.3B-pp8-m2
121-gpt2xl-dp1-gbs16-oom                 404-gpt2-8.3B-pp8-m4
122-gpt2xl-dp2-gbs32-zero2               408-gpt2-8.3B-pp8-m8
124-gpt2xl-dp4-gbs64-zero2               416-gpt2-8.3B-pp8-m16
128-gpt2xl-dp8-gbs128-zero2              420-gpt2-8.3B-tpsp8
129-gpt2xl-dp8-gbs168-zero2              424-gpt2-8.3B-tpsp2-pp4-m4
211-weak-scale-tp1-gpt2-1.2B-paper       432-gpt2-8.3B-pp8-m32
212-weak-scale-tp2-gpt2-2.5B-paper       438-gpt2-8.3B-pp8-vpp3-m8
214-weak-scale-tp4-gpt2-4.2B-paper       443-gpt2-8.3B-tpsp2-pp4-vpp3-m4
218-weak-scale-tp8-gpt2-8.3B-paper       446-gpt2-8.3B-tpsp2-pp4-vpp6-m4
221-weak-scale-tp1-gpt2-1.2B-gbs20       449-gpt2-8.3B-tpsp2-pp4-vpp9-m4
222-weak-scale-tp2-gpt2-2.5B-gbs20       458-gpt2-8.3B-tpsp2-pp4-vpp18-m4
224-weak-scale-tp4-gpt2-4.2B-gbs20       498-gpt2-8.3B-pp8-vpp9-m8
228-weak-scale-tp8-gpt2-8.3B-gbs20       500-olmoe-dp8
231-strong-scale-gpt2-1.2B-tp1-paper     501-olmoe-dp8-zero2
232-strong-scale-gpt2-1.2B-tp2-paper     502-olmoe-dp8-zero2-ep8
234-strong-scale-gpt2-1.2B-tp4-paper     503-olmoe-2xE-dp8-zero2
238-strong-scale-gpt2-1.2B-tp8-paper     504-olmoe-2xE-dp8-zero2-ep8
241-strong-scale-gpt2-1.2B-tp1-gbs20     505-olmoe-4xE-dp8-zero2
242-strong-scale-gpt2-1.2B-tp2-gbs20     506-olmoe-4xE-dp8-zero2-ep8
244-strong-scale-gpt2-1.2B-tp4-gbs20     511-olmoe-4xE-dp8-zero2-ep4-etp2
248-strong-scale-gpt2-1.2B-tp8-gbs20     512-olmoe-4xE-dp8-zero2-ep2-etp4
281-gpt-22B-tp8-gbs4-len2048-oom         count-arguments
282-gpt-22B-tp8-gbs4-len2048-sp          how-to-recompute-activation
283-gpt-22B-tp8-gbs4-len2048-ra          install-dependencies
300-cp1-gpt2-1.2B-gbs8-len4096-ra        prepare-ds-openwebtext-10k
301-cp1-gpt2-1.2B-gbs8-len4096-oom       profile-282-gpt-22B-tp8-gbs4-len2048-sp
302-cp2-gpt2-1.2B-gbs8-len4096           profile-283-gpt-22B-tp8-gbs4-len2048-ra
304-cp4-gpt2-1.2B-gbs8-len4096           show-arguments
308-cp8-gpt2-1.2B-gbs8-len4096           
@misc{chua2025megatrontransformed,
  title        = {Megatron, Transformed! A Hands-on Megatron-LM Tutorial on Replicating Empirical Trends in Distributed Training and Model Parallelism},
  author       = {Chua, Vui Seng},
  year         = {2025},
  url          = {https://github.com/vuiseng9/megatron-tutorials},
}

Generated by Gemini 3 Pro conditioned on previous image

About

Hands-on Megatron-LM tutorials on ablating parallelism and scaling trends. DP → ZeRO → TP → SP → CP → PP → VPP → EP

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published