diff --git a/docs/get-started/quickstart.md b/docs/get-started/quickstart.md index 61868e7877c..28514e3b2a3 100644 --- a/docs/get-started/quickstart.md +++ b/docs/get-started/quickstart.md @@ -33,7 +33,7 @@ torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py ```bash # 8 GPUs, FP8 precision, mock data -./examples/llama/train_llama3_8b_fp8.sh +./examples/open_models/llama/train_llama3_8b_fp8.sh ``` ## Data Preparation diff --git a/docs/models/llms.md b/docs/models/llms.md index 6789a4c551c..3402c522713 100644 --- a/docs/models/llms.md +++ b/docs/models/llms.md @@ -34,12 +34,10 @@ See the [Megatron Bridge supported models list](https://github.com/NVIDIA-NeMo/M ## Example Scripts Training examples for these models can be found in the `examples/` directory: -- `examples/gpt3/` - GPT-3 training scripts -- `examples/llama/` - LLaMA training scripts -- `examples/mixtral/` - Mixtral MoE training -- `examples/mamba/` - Mamba training scripts -- `examples/bert/` - BERT training scripts -- `examples/t5/` - T5 training scripts +- `examples/open_models/gpt3/` - GPT-3 training scripts +- `examples/open_models/llama/` - LLaMA training scripts +- `examples/open_models/mamba/` - Mamba training scripts +- `examples/open_models/t5/` - T5 training scripts ## Model Implementation diff --git a/docs/models/multimodal.md b/docs/models/multimodal.md index 66ed8ccd9cb..11475495894 100644 --- a/docs/models/multimodal.md +++ b/docs/models/multimodal.md @@ -14,7 +14,7 @@ Megatron Core supports multimodal models that combine language with vision, audi - Unified embedding space across modalities - Support for both vision-language and audio-vision-language models -See [examples/mimo](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/mimo) for training scripts and examples. +See [examples/open_models/mimo](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/open_models/mimo) for training scripts and examples. ## Vision-Language Models @@ -52,7 +52,7 @@ For multimodal diffusion models (image generation, text-to-image, etc.), see [Ne Multimodal training examples can be found in the following directories: **MIMO Framework:** -- `examples/mimo/` - Multimodal In/Out training with support for vision-language and audio-vision-language models +- `examples/open_models/mimo/` - Multimodal In/Out training with support for vision-language and audio-vision-language models **Specific Multimodal Models:** - `examples/multimodal/` - LLaVA-style training with Mistral + CLIP diff --git a/docs/user-guide/training-examples.md b/docs/user-guide/training-examples.md index 2824c608c36..2295fd16b40 100644 --- a/docs/user-guide/training-examples.md +++ b/docs/user-guide/training-examples.md @@ -24,7 +24,7 @@ This example: Train LLaMA-3 8B model with FP8 mixed precision on 8 GPUs: ```bash -./examples/llama/train_llama3_8b_fp8.sh +./examples/open_models/llama/train_llama3_8b_fp8.sh ``` **Configuration:** diff --git a/examples/bert/README.md b/examples/bert/README.md deleted file mode 100644 index 6c1fe95bf06..00000000000 --- a/examples/bert/README.md +++ /dev/null @@ -1,53 +0,0 @@ -# BERT MODEL - -## Table of contents -- [1. Training Setup](#1-training-setup) -- [2. Configurations](#2-configurations) - -## 1. Training setup - - -To run the model using a docker container run it as follows -``` -PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:24.01-py3 -CHECKPOINT_PATH="" # -TENSORBOARD_LOGS_PATH=""# -VOCAB_FILE="" #//bert-vocab.txt -DATA_PATH="" #_text_document - -docker run \ - --gpus=all \ - --ipc=host \ - --workdir /workspace/megatron-lm \ - -v /path/to/data:/path/to/data \ - -v /path/to/megatron-lm:/workspace/megatron-lm \ - megatron-lm nvcr.io/nvidia/pytorch:24.01-py3 \ - bash examples/bert/train_bert_340m_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $DATA_PATH " - -``` -NOTE: Depending on the environment you are running it the above command might like slightly different. - - -## 2. Configurations - -The example in this folder shows you how to run 340m large model. There are other configs you could run as well - -### 4B -``` - --num-layers 48 \ - --hidden-size 2560 \ - --num-attention-heads 32 \ - --tensor-model-parallel-size 1 \ - --pipeline-model-parallel-size 1 \ - -``` - -### 20B -``` - --num-layers 48 \ - --hidden-size 6144 \ - --num-attention-heads 96 \ - --tensor-model-parallel-size 4 \ - --pipeline-model-parallel-size 4 \ - -``` \ No newline at end of file diff --git a/examples/bert/train_bert_340m_distributed.sh b/examples/bert/train_bert_340m_distributed.sh deleted file mode 100644 index f0d9c87c8bf..00000000000 --- a/examples/bert/train_bert_340m_distributed.sh +++ /dev/null @@ -1,79 +0,0 @@ -#!/bin/bash - -# Runs the "340M" parameter model (Bert - Large) - -export CUDA_DEVICE_MAX_CONNECTIONS=1 - -GPUS_PER_NODE=8 -# Change for multinode config -MASTER_ADDR=localhost -MASTER_PORT=6000 -NUM_NODES=1 -NODE_RANK=0 -WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES)) - -CHECKPOINT_PATH=$1 # -TENSORBOARD_LOGS_PATH=$2 # -VOCAB_FILE=$3 #/bert-vocab.json -DATA_PATH=$4 #_text_document - -DISTRIBUTED_ARGS=( - --nproc_per_node $GPUS_PER_NODE - --nnodes $NUM_NODES - --master_addr $MASTER_ADDR - --master_port $MASTER_PORT -) - -BERT_MODEL_ARGS=( - --num-layers 24 - --hidden-size 1024 - --num-attention-heads 16 - --seq-length 512 - --max-position-embeddings 512 - --attention-backend auto # Can use (flash/fused/unfused/local) -) - -TRAINING_ARGS=( - --micro-batch-size 4 - --global-batch-size 32 - --train-iters 1000000 - --weight-decay 1e-2 - --clip-grad 1.0 - --fp16 - --lr 0.0001 - --lr-decay-iters 990000 - --lr-decay-style linear - --min-lr 1.0e-5 - --weight-decay 1e-2 - --lr-warmup-fraction .01 - --clip-grad 1.0 -) - -MODEL_PARALLEL_ARGS=( - --tensor-model-parallel-size 8 - --pipeline-model-parallel-size 16 -) - -DATA_ARGS=( - --data-path $DATA_PATH - --vocab-file $VOCAB_FILE - --split 949,50,1 -) - -EVAL_AND_LOGGING_ARGS=( - --log-interval 100 - --save-interval 10000 - --eval-interval 1000 - --save $CHECKPOINT_PATH - --load $CHECKPOINT_PATH - --eval-iters 10 - --tensorboard-dir $TENSORBOARD_LOGS_PATH -) - -torchrun ${DISTRIBUTED_ARGS[@]} pretrain_bert.py \ - ${BERT_MODEL_ARGS[@]} \ - ${TRAINING_ARGS[@]} \ - ${MODEL_PARALLEL_ARGS[@]} \ - ${DATA_ARGS[@]} \ - ${EVAL_AND_LOGGING_ARGS[@]} - \ No newline at end of file diff --git a/examples/mixtral/README.md b/examples/mixtral/README.md deleted file mode 100644 index e85eccd6efd..00000000000 --- a/examples/mixtral/README.md +++ /dev/null @@ -1,132 +0,0 @@ -# Mixtral 8x7B Model Inference and Finetuning - -## Download Mixtral 8x7B Checkpoints -Download Mixtral 8x7B HF format checkpoint from [HF-hub](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/) - -Or you can simply run this following script to download Mixtral 8x7B into a specific folder. -```python -from huggingface_hub import snapshot_download -SAVED_DIR = "" # Specify the saved directory -# Download HF checkpoints -snapshot_download(repo_id="mistralai/Mixtral-8x7B-v0.1", ignore_patterns=["*.pt"], local_dir=SAVED_DIR, local_dir_use_symlinks=False) -``` - -## Convert Mixtral 8x7B checkpoints from HF to MCore -The HF checkpoints can be converted to Megatron format by using the provided checkpoint converter for HF format. -The target model parallel size(e.g. TP,PP,EP) should be specified. - -Currently the converter doesn't support distributed checkpointing yet, so each different parallel config requires a specific checkpoint. -- For training, the recommended model parallel config is TP1EP8PP4 -- For inference, the recommended model parallel config is TP1EP1PP2 - -``` -TOKENIZER_MODEL=/workspace/checkpoints/mixtral-hf/tokenizer.model -MEGATRON_PATH="/workspace/megatron-lm" -export PYTHONPATH=$MEGATRON_PATH:$PYTHONPATH -export CUDA_DEVICE_MAX_CONNECTIONS=1 - -TARGET_TP_SIZE="" -TARGET_EP_SIZE="" -TARGET_PP_SIZE="" - -HF_FORMAT_DIR=/workspace/checkpoints/mixtral-hf -MEGATRON_FORMAT_DIR=/workspace/checkpoints/mixtral-mcore-TP${TARGET_TP_SIZE}PP${TARGET_PP_SIZE}EP${TARGET_EP_SIZE} - -python tools/checkpoint/convert.py \ ---model-type GPT \ ---loader loader_mixtral_hf \ ---saver mcore \ ---target-tensor-parallel-size ${TARGET_TP_SIZE} \ ---target-pipeline-parallel-size ${TARGET_PP_SIZE} \ ---target-expert-parallel-size ${TARGET_EP_SIZE} \ ---load-dir ${HF_FORMAT_DIR} \ ---save-dir ${MEGATRON_FORMAT_DIR} \ ---tokenizer-model ${TOKENIZER_MODEL} -``` - -## Text generation with Mixtral 8x7B -Inference with Mixtral 8x7B requires at least 2 GPUS, such that a distributed checkpoint with EP>=2 or PP>=2 converted with above script is needed. - -The Megatron-LM have included a simple REST server to use for text generation in `tools/run_text_generation_server.py`, launch it with the following script: -``` -#!/bin/bash -# This example will start serving the Mixtral 8x7B model. -DISTRIBUTED_ARGS="--nproc_per_node 2 \ - --nnodes 1 \ - --node_rank 0 \ - --master_addr localhost \ - --master_port 6000" - -CHECKPOINT= -TOKENIZER_MODEL= - -export CUDA_DEVICE_MAX_CONNECTIONS=1 - -pip install flask-restful - -torchrun $DISTRIBUTED_ARGS tools/run_text_generation_server.py \ - --tensor-model-parallel-size 1 \ - --pipeline-model-parallel-size 2 \ - --expert-model-parallel-size 1 \ - --load ${CHECKPOINT} \ - --tokenizer-type Llama2Tokenizer \ - --tokenizer-model $TOKENIZER_MODEL \ - --use-mcore-models \ - --max-position-embeddings 32768 \ - --num-layers 32 \ - --hidden-size 4096 \ - --ffn-hidden-size 14336 \ - --num-attention-heads 32 \ - --normalization RMSNorm \ - --disable-bias-linear \ - --position-embedding-type rope \ - --no-position-embedding \ - --swiglu \ - --untie-embeddings-and-output-weights \ - --group-query-attention \ - --num-query-groups 8 \ - --bf16 \ - --micro-batch-size 1 \ - --seq-length 1024 \ - --seed 42 \ - --num-experts 8 \ - --moe-router-topk 2 \ - --moe-token-dispatcher-type alltoall \ - --moe-grouped-gemm \ - --mock-data \ - --rotary-base 1000000 -``` - -Once the server is running you can use `tools/text_generation_cli.py` to query it, it takes one argument which is the host the server is running on. - -``` -python tools/text_generation_cli.py localhost:5000 -``` - - -## Finetuning from pretrained Mixtral 8x7B -To finetuning pretrained Mixtral 8x7B, use the following scripts: - - -```bash -PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:24.04-py3 -CHECKPOINT_PATH="" # Speicfy path to checkpoint dir -TOKENIZER_MODEL="" # Specify path to tokenizer.model -DATA_PATH="" # Specify path to data - -docker run \ - --gpus=all \ - --ipc=host \ - --workdir /workspace/megatron-lm \ - -v /path/to/data:/path/to/data \ - -v /path/to/megatron-lm:/workspace/megatron-lm \ - $PYTORCH_IMAGE \ - bash examples/mixtral/train_mixtral_8x7b_distributed.sh $CHECKPOINT_PATH $TOKENIZER_MODEL $DATA_PATH -``` - -The above functionality also applys to Mixtral 8x22B actually, you should set the model config (including hidden_size/head_num/num_layers/ffn_hidden_size) properly according to the original [config](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1/blob/main/config.json). - -## Acknowledgements -Contributors outside NVIDIA for the huggingface converter and example of Mixtral models in Megatron-Core: -- Peng Li -- Jun Huang diff --git a/examples/mixtral/train_mixtral_8x7b_distributed.sh b/examples/mixtral/train_mixtral_8x7b_distributed.sh deleted file mode 100644 index ed44d60f5c0..00000000000 --- a/examples/mixtral/train_mixtral_8x7b_distributed.sh +++ /dev/null @@ -1,116 +0,0 @@ -#!/bin/bash - -# Runs Mixtral 8x7B model - -export CUDA_DEVICE_MAX_CONNECTIONS=1 - -GPUS_PER_NODE=8 -# Change for multinode config -MASTER_ADDR=${MASTER_ADDR:-"localhost"} -MASTER_PORT=${MASTER_PORT:-"6000"} -NNODES=${SLURM_NNODES:-"1"} -NODE_RANK=${RANK:-"0"} -WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES)) - -CHECKPOINT_PATH=$1 -TOKENIZER_MODEL=$2 -DATA_PATH=$3 - -DISTRIBUTED_ARGS=( - --nproc_per_node $GPUS_PER_NODE - --nnodes $NNODES - --node_rank $NODE_RANK - --master_addr $MASTER_ADDR - --master_port $MASTER_PORT -) - -MODEL_ARGS=( - --use-mcore-models - --disable-bias-linear - --seq-length 4096 - --max-position-embeddings 32768 - --num-layers 32 - --hidden-size 4096 - --ffn-hidden-size 14336 - --num-attention-heads 32 - --init-method-std 0.01 - --attention-dropout 0.0 - --hidden-dropout 0.0 - --normalization RMSNorm - --position-embedding-type rope - --swiglu - --untie-embeddings-and-output-weights - --group-query-attention - --num-query-groups 8 - --no-masked-softmax-fusion - --no-position-embedding - --rotary-base 1000000 -) - -MOE_ARGS=( - --num-experts 8 - --moe-router-topk 2 - --moe-router-load-balancing-type aux_loss - --moe-aux-loss-coeff 1e-2 - --moe-grouped-gemm - --moe-token-dispatcher-type alltoall - --overlap-param-gather - --overlap-grad-reduce -) - -DATA_ARGS=( - --tokenizer-type Llama2Tokenizer - --tokenizer-model ${TOKENIZER_MODEL} - --data-path $DATA_PATH - --split 99990,8,2 -) - -TRAINING_ARGS=( - --micro-batch-size 1 - --global-batch-size 256 - --lr 1e-4 - --train-iters 500000 - --lr-decay-iters 320000 - --lr-decay-style cosine - --min-lr 1.0e-5 - --weight-decay 0.1 - --lr-warmup-iters 500 - --clip-grad 1.0 - --bf16 -) - -MODEL_PARALLEL_ARGS=( - --tensor-model-parallel-size 1 - --pipeline-model-parallel-size 4 - --expert-model-parallel-size 8 - --use-distributed-optimizer - --sequence-parallel -) - -LOGGING_ARGS=( - --log-interval 1 \ - --save-interval 10000 \ - --eval-interval 1000 \ - --eval-iters 10 \ - --save $CHECKPOINT_PATH \ - --load $CHECKPOINT_PATH \ - --tensorboard-dir "${CHECKPOINT_PATH}/tensorboard" \ - --no-load-optim \ - --no-load-rng -) - -if [ -n "${WANDB_API_KEY}" ]; then - LOGGING_ARGS+=( - --wandb-project ${WANDB_PROJECT:-"Mixtral"} - --wandb-exp-name ${WANDB_NAME:-"Mixtral_8x7B"} - ) -fi - - -torchrun ${DISTRIBUTED_ARGS[@]} pretrain_gpt.py \ - ${MODEL_ARGS[@]} \ - ${MOE_ARGS[@]} \ - ${DATA_ARGS[@]} \ - ${TRAINING_ARGS[@]} \ - ${MODEL_PARALLEL_ARGS[@]} \ - ${LOGGING_ARGS[@]} diff --git a/examples/gpt3/README.md b/examples/open_models/gpt3/README.md similarity index 91% rename from examples/gpt3/README.md rename to examples/open_models/gpt3/README.md index 8d6f2674163..4bf11cdb7e1 100644 --- a/examples/gpt3/README.md +++ b/examples/open_models/gpt3/README.md @@ -24,7 +24,7 @@ docker run \ -v /path/to/data:/path/to/data \ -v /path/to/megatron-lm:/workspace/megatron-lm \ megatron-lm nvcr.io/nvidia/pytorch:24.01-py3 \ - bash examples/gpt3/train_gpt3_175b_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $MERGE_FILE $DATA_PATH " + bash examples/open_models/gpt3/train_gpt3_175b_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $MERGE_FILE $DATA_PATH " ``` NOTE: Depending on the environment you are running it the above command might like slightly different. diff --git a/examples/gpt3/gpt_config.yaml b/examples/open_models/gpt3/gpt_config.yaml similarity index 100% rename from examples/gpt3/gpt_config.yaml rename to examples/open_models/gpt3/gpt_config.yaml diff --git a/examples/gpt3/train_gpt3_175b_distributed.sh b/examples/open_models/gpt3/train_gpt3_175b_distributed.sh similarity index 100% rename from examples/gpt3/train_gpt3_175b_distributed.sh rename to examples/open_models/gpt3/train_gpt3_175b_distributed.sh diff --git a/examples/llama/README.md b/examples/open_models/llama/README.md similarity index 97% rename from examples/llama/README.md rename to examples/open_models/llama/README.md index 9872185ab2f..c4470cd4812 100644 --- a/examples/llama/README.md +++ b/examples/open_models/llama/README.md @@ -45,7 +45,7 @@ docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \ -v "${HOST_TENSORBOARD_LOGS_PATH}:/workspace/tensorboard_logs" \ --workdir /workspace/megatron-lm \ $PYTORCH_IMAGE \ - bash examples/llama/train_llama3_8b_h100_fp8.sh \ + bash examples/open_models/llama/train_llama3_8b_h100_fp8.sh \ /workspace/checkpoints \ /workspace/tensorboard_logs \ 2>&1 | tee "${HOST_TENSORBOARD_LOGS_PATH}/training_mock_$(date +'%y-%m-%d_%H-%M-%S').log" @@ -63,7 +63,7 @@ docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \ -v "$(dirname "${HOST_DATA_PREFIX}"):/workspace/data_dir" \ --workdir /workspace/megatron-lm \ $PYTORCH_IMAGE \ - bash examples/llama/train_llama3_8b_h100_fp8.sh \ + bash examples/open_models/llama/train_llama3_8b_h100_fp8.sh \ /workspace/checkpoints \ /workspace/tensorboard_logs \ /workspace/tokenizer_model \ diff --git a/examples/llama/train_llama3_8b_h100_fp8.sh b/examples/open_models/llama/train_llama3_8b_h100_fp8.sh similarity index 100% rename from examples/llama/train_llama3_8b_h100_fp8.sh rename to examples/open_models/llama/train_llama3_8b_h100_fp8.sh diff --git a/examples/mamba/.gitignore b/examples/open_models/mamba/.gitignore similarity index 100% rename from examples/mamba/.gitignore rename to examples/open_models/mamba/.gitignore diff --git a/examples/mamba/Dockerfile b/examples/open_models/mamba/Dockerfile similarity index 100% rename from examples/mamba/Dockerfile rename to examples/open_models/mamba/Dockerfile diff --git a/examples/mamba/README.md b/examples/open_models/mamba/README.md similarity index 96% rename from examples/mamba/README.md rename to examples/open_models/mamba/README.md index f8f6d796837..f26a3de8dfe 100644 --- a/examples/mamba/README.md +++ b/examples/open_models/mamba/README.md @@ -22,7 +22,7 @@ docker run --gpus all -it --rm \ -v /path/to/megatron:/workspace/megatron \ -v /path/to/dataset:/workspace/dataset \ -v /path/to/checkpoints:/workspace/checkpoints \ - -w /workspace/megatron/examples/mamba \ + -w /workspace/megatron/examples/open_models/mamba \ your_image_name:your_tag ``` @@ -55,7 +55,7 @@ including the hybrid layer configuration and model parallel configuration. If you need to convert a hybrid checkpoint file to a different tensor parallel or pipeline parallel size, use -[the hybrid conversion script](../../tools/checkpoint/hybrid_conversion.py). +[the hybrid conversion script](../../../tools/checkpoint/hybrid_conversion.py). There is an example run command at the end of that file. Before running that script, you will need to set `PYTHONPATH` to include the diff --git a/examples/mamba/run_text_gen_server_8b.sh b/examples/open_models/mamba/run_text_gen_server_8b.sh similarity index 89% rename from examples/mamba/run_text_gen_server_8b.sh rename to examples/open_models/mamba/run_text_gen_server_8b.sh index 8d3137f2442..ef328fb0063 100755 --- a/examples/mamba/run_text_gen_server_8b.sh +++ b/examples/open_models/mamba/run_text_gen_server_8b.sh @@ -1,7 +1,7 @@ #!/bin/bash # Use: ./run_text_gen_server_8b.sh -# To launch the client: python ../../tools/text_generation_cli.py +# To launch the client: python ../../../tools/text_generation_cli.py CHECKPOINT_PATH=$1 TOKENIZER_PATH=$2 @@ -20,7 +20,7 @@ export NCCL_IB_QPS_PER_CONNECTION=4 export TRITON_CACHE_DIR="./triton-cache/" export TRITON_CACHE_MANAGER="megatron.core.ssm.triton_cache_manager:ParallelFileCacheManager" -torchrun $DISTRIBUTED_ARGS ../../tools/run_mamba_text_generation_server.py \ +torchrun $DISTRIBUTED_ARGS ../../../tools/run_mamba_text_generation_server.py \ --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 1 \ --untie-embeddings-and-output-weights \ diff --git a/examples/mamba/run_text_gen_server_8b_gpt3.sh b/examples/open_models/mamba/run_text_gen_server_8b_gpt3.sh similarity index 88% rename from examples/mamba/run_text_gen_server_8b_gpt3.sh rename to examples/open_models/mamba/run_text_gen_server_8b_gpt3.sh index 5413b245ed3..f427a76dba5 100644 --- a/examples/mamba/run_text_gen_server_8b_gpt3.sh +++ b/examples/open_models/mamba/run_text_gen_server_8b_gpt3.sh @@ -1,7 +1,7 @@ #!/bin/bash # Use: ./run_text_gen_server_8b_gpt3.sh -# To launch the client: python ../../tools/text_generation_cli.py +# To launch the client: python ../../../tools/text_generation_cli.py CHECKPOINT_PATH=$1 TOKENIZER_PATH=$2 @@ -17,7 +17,7 @@ export CUDA_DEVICE_MAX_CONNECTIONS=1 export NCCL_IB_TIMEOUT=19 export NCCL_IB_QPS_PER_CONNECTION=4 -torchrun $DISTRIBUTED_ARGS ../../tools/run_text_generation_server.py \ +torchrun $DISTRIBUTED_ARGS ../../../tools/run_text_generation_server.py \ --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 1 \ --use-flash-attn \ diff --git a/examples/mamba/train.sh b/examples/open_models/mamba/train.sh similarity index 97% rename from examples/mamba/train.sh rename to examples/open_models/mamba/train.sh index 3952a997d47..b6f780d09aa 100755 --- a/examples/mamba/train.sh +++ b/examples/open_models/mamba/train.sh @@ -102,4 +102,4 @@ options=" \ --no-create-attention-mask-in-dataloader \ --tensorboard-dir ${TENSORBOARD_DIR}" -torchrun --nproc_per_node 8 ../../pretrain_mamba.py ${options} +torchrun --nproc_per_node 8 ../../../pretrain_mamba.py ${options} diff --git a/examples/mimo/__init__.py b/examples/open_models/mimo/__init__.py similarity index 100% rename from examples/mimo/__init__.py rename to examples/open_models/mimo/__init__.py diff --git a/examples/mimo/avlm_inference.py b/examples/open_models/mimo/avlm_inference.py similarity index 100% rename from examples/mimo/avlm_inference.py rename to examples/open_models/mimo/avlm_inference.py diff --git a/examples/mimo/configs/llava_avlm.py b/examples/open_models/mimo/configs/llava_avlm.py similarity index 100% rename from examples/mimo/configs/llava_avlm.py rename to examples/open_models/mimo/configs/llava_avlm.py diff --git a/examples/mimo/configs/llava_vlm.py b/examples/open_models/mimo/configs/llava_vlm.py similarity index 100% rename from examples/mimo/configs/llava_vlm.py rename to examples/open_models/mimo/configs/llava_vlm.py diff --git a/examples/mimo/configs/mock.py b/examples/open_models/mimo/configs/mock.py similarity index 100% rename from examples/mimo/configs/mock.py rename to examples/open_models/mimo/configs/mock.py diff --git a/examples/mimo/data/__init__.py b/examples/open_models/mimo/data/__init__.py similarity index 100% rename from examples/mimo/data/__init__.py rename to examples/open_models/mimo/data/__init__.py diff --git a/examples/mimo/data/avlm_sample_loader.py b/examples/open_models/mimo/data/avlm_sample_loader.py similarity index 100% rename from examples/mimo/data/avlm_sample_loader.py rename to examples/open_models/mimo/data/avlm_sample_loader.py diff --git a/examples/mimo/data/energon_avlm_task_encoder.py b/examples/open_models/mimo/data/energon_avlm_task_encoder.py similarity index 100% rename from examples/mimo/data/energon_avlm_task_encoder.py rename to examples/open_models/mimo/data/energon_avlm_task_encoder.py diff --git a/examples/mimo/data/energon_vlm_task_encoder.py b/examples/open_models/mimo/data/energon_vlm_task_encoder.py similarity index 100% rename from examples/mimo/data/energon_vlm_task_encoder.py rename to examples/open_models/mimo/data/energon_vlm_task_encoder.py diff --git a/examples/mimo/data/mock.py b/examples/open_models/mimo/data/mock.py similarity index 100% rename from examples/mimo/data/mock.py rename to examples/open_models/mimo/data/mock.py diff --git a/examples/mimo/data/prepare_video_llava_data.py b/examples/open_models/mimo/data/prepare_video_llava_data.py similarity index 100% rename from examples/mimo/data/prepare_video_llava_data.py rename to examples/open_models/mimo/data/prepare_video_llava_data.py diff --git a/examples/mimo/data/utils/calculate_audio_tokens.py b/examples/open_models/mimo/data/utils/calculate_audio_tokens.py similarity index 100% rename from examples/mimo/data/utils/calculate_audio_tokens.py rename to examples/open_models/mimo/data/utils/calculate_audio_tokens.py diff --git a/examples/mimo/model_providers/__init__.py b/examples/open_models/mimo/model_providers/__init__.py similarity index 100% rename from examples/mimo/model_providers/__init__.py rename to examples/open_models/mimo/model_providers/__init__.py diff --git a/examples/mimo/model_providers/hf_clip_encoder.py b/examples/open_models/mimo/model_providers/hf_clip_encoder.py similarity index 100% rename from examples/mimo/model_providers/hf_clip_encoder.py rename to examples/open_models/mimo/model_providers/hf_clip_encoder.py diff --git a/examples/mimo/model_providers/hf_whisper_encoder.py b/examples/open_models/mimo/model_providers/hf_whisper_encoder.py similarity index 100% rename from examples/mimo/model_providers/hf_whisper_encoder.py rename to examples/open_models/mimo/model_providers/hf_whisper_encoder.py diff --git a/examples/mimo/model_providers/llava_avlm.py b/examples/open_models/mimo/model_providers/llava_avlm.py similarity index 100% rename from examples/mimo/model_providers/llava_avlm.py rename to examples/open_models/mimo/model_providers/llava_avlm.py diff --git a/examples/mimo/model_providers/llava_vlm.py b/examples/open_models/mimo/model_providers/llava_vlm.py similarity index 100% rename from examples/mimo/model_providers/llava_vlm.py rename to examples/open_models/mimo/model_providers/llava_vlm.py diff --git a/examples/mimo/model_providers/mock.py b/examples/open_models/mimo/model_providers/mock.py similarity index 100% rename from examples/mimo/model_providers/mock.py rename to examples/open_models/mimo/model_providers/mock.py diff --git a/examples/mimo/scripts/run_avlm_train.sh b/examples/open_models/mimo/scripts/run_avlm_train.sh similarity index 94% rename from examples/mimo/scripts/run_avlm_train.sh rename to examples/open_models/mimo/scripts/run_avlm_train.sh index ced70cd0047..6b5cfbb359f 100755 --- a/examples/mimo/scripts/run_avlm_train.sh +++ b/examples/open_models/mimo/scripts/run_avlm_train.sh @@ -116,7 +116,7 @@ AUDIO_MODEL_ARGS=( if [ "$DEBUG_MODE" = true ]; then echo "Running in debug mode with $GPUS_PER_NODE GPU(s) per node..." echo "Debugger listening on port $DEBUG_PORT - connect with your IDE to this port" - debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \ + debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \ ${TRAINING_ARGS[@]} \ ${MODEL_PARALLEL_ARGS[@]} \ ${EVAL_AND_LOGGING_ARGS[@]} \ @@ -128,7 +128,7 @@ else echo "Running in normal mode with $GPUS_PER_NODE GPU(s) per node..." if [ "$DRY_RUN" = true ]; then echo "Dry run mode enabled" - echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \ + echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \ ${TRAINING_ARGS[@]} \ ${MODEL_PARALLEL_ARGS[@]} \ ${EVAL_AND_LOGGING_ARGS[@]} \ @@ -137,7 +137,7 @@ else ${AUDIO_MODEL_ARGS[@]} \ ${DATASET_ARGS[@]}" else - torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \ + torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \ ${TRAINING_ARGS[@]} \ ${MODEL_PARALLEL_ARGS[@]} \ ${EVAL_AND_LOGGING_ARGS[@]} \ diff --git a/examples/mimo/scripts/run_mock_train.sh b/examples/open_models/mimo/scripts/run_mock_train.sh similarity index 91% rename from examples/mimo/scripts/run_mock_train.sh rename to examples/open_models/mimo/scripts/run_mock_train.sh index 2ed71cd5ede..5970fe60767 100755 --- a/examples/mimo/scripts/run_mock_train.sh +++ b/examples/open_models/mimo/scripts/run_mock_train.sh @@ -1,7 +1,7 @@ #!/bin/bash # from the root of the repo -# ./examples/mimo/scripts/run_mock_train.sh +# ./examples/open_models/mimo/scripts/run_mock_train.sh export CUDA_DEVICE_MAX_CONNECTIONS=1 DRY_RUN=false @@ -82,7 +82,7 @@ GPT_MODEL_ARGS=( if [ "$DEBUG_MODE" = true ]; then echo "Running in debug mode with $GPUS_PER_NODE GPU(s) per node..." echo "Debugger listening on port $DEBUG_PORT - connect with your IDE to this port" - debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \ + debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \ ${TRAINING_ARGS[@]} \ ${MODEL_PARALLEL_ARGS[@]} \ ${EVAL_AND_LOGGING_ARGS[@]} \ @@ -92,14 +92,14 @@ else echo "Running in normal mode with $GPUS_PER_NODE GPU(s) per node..." if [ "$DRY_RUN" = true ]; then echo "Dry run mode enabled" - echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \ + echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \ ${TRAINING_ARGS[@]} \ ${MODEL_PARALLEL_ARGS[@]} \ ${EVAL_AND_LOGGING_ARGS[@]} \ ${TOKENIZER_ARGS[@]} \ ${GPT_MODEL_ARGS[@]}" else - torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \ + torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \ ${TRAINING_ARGS[@]} \ ${MODEL_PARALLEL_ARGS[@]} \ ${EVAL_AND_LOGGING_ARGS[@]} \ diff --git a/examples/mimo/scripts/run_video_vlm_train.sh b/examples/open_models/mimo/scripts/run_video_vlm_train.sh similarity index 94% rename from examples/mimo/scripts/run_video_vlm_train.sh rename to examples/open_models/mimo/scripts/run_video_vlm_train.sh index 2ec8af9d55f..ff63c1fc495 100755 --- a/examples/mimo/scripts/run_video_vlm_train.sh +++ b/examples/open_models/mimo/scripts/run_video_vlm_train.sh @@ -110,7 +110,7 @@ GPT_MODEL_ARGS=( if [ "$DEBUG_MODE" = true ]; then echo "Running in debug mode with $GPUS_PER_NODE GPU(s) per node..." echo "Debugger listening on port $DEBUG_PORT - connect with your IDE to this port" - debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \ + debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \ ${TRAINING_ARGS[@]} \ ${MODEL_PARALLEL_ARGS[@]} \ ${EVAL_AND_LOGGING_ARGS[@]} \ @@ -121,7 +121,7 @@ else echo "Running in normal mode with $GPUS_PER_NODE GPU(s) per node..." if [ "$DRY_RUN" = true ]; then echo "Dry run mode enabled" - echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \ + echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \ ${TRAINING_ARGS[@]} \ ${MODEL_PARALLEL_ARGS[@]} \ ${EVAL_AND_LOGGING_ARGS[@]} \ @@ -129,7 +129,7 @@ else ${GPT_MODEL_ARGS[@]} \ ${DATASET_ARGS[@]}" else - torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \ + torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \ ${TRAINING_ARGS[@]} \ ${MODEL_PARALLEL_ARGS[@]} \ ${EVAL_AND_LOGGING_ARGS[@]} \ diff --git a/examples/mimo/scripts/run_vlm_train.sh b/examples/open_models/mimo/scripts/run_vlm_train.sh similarity index 94% rename from examples/mimo/scripts/run_vlm_train.sh rename to examples/open_models/mimo/scripts/run_vlm_train.sh index 702b29795a4..bc95f9cc6c1 100755 --- a/examples/mimo/scripts/run_vlm_train.sh +++ b/examples/open_models/mimo/scripts/run_vlm_train.sh @@ -111,7 +111,7 @@ GPT_MODEL_ARGS=( if [ "$DEBUG_MODE" = true ]; then echo "Running in debug mode with $GPUS_PER_NODE GPU(s) per node..." echo "Debugger listening on port $DEBUG_PORT - connect with your IDE to this port" - debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \ + debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \ ${TRAINING_ARGS[@]} \ ${MODEL_PARALLEL_ARGS[@]} \ ${EVAL_AND_LOGGING_ARGS[@]} \ @@ -122,7 +122,7 @@ else echo "Running in normal mode with $GPUS_PER_NODE GPU(s) per node..." if [ "$DRY_RUN" = true ]; then echo "Dry run mode enabled" - echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \ + echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \ ${TRAINING_ARGS[@]} \ ${MODEL_PARALLEL_ARGS[@]} \ ${EVAL_AND_LOGGING_ARGS[@]} \ @@ -130,7 +130,7 @@ else ${GPT_MODEL_ARGS[@]} \ ${DATASET_ARGS[@]}" else - torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \ + torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \ ${TRAINING_ARGS[@]} \ ${MODEL_PARALLEL_ARGS[@]} \ ${EVAL_AND_LOGGING_ARGS[@]} \ diff --git a/examples/mimo/train.py b/examples/open_models/mimo/train.py similarity index 100% rename from examples/mimo/train.py rename to examples/open_models/mimo/train.py diff --git a/examples/mimo/utils/__init__.py b/examples/open_models/mimo/utils/__init__.py similarity index 100% rename from examples/mimo/utils/__init__.py rename to examples/open_models/mimo/utils/__init__.py diff --git a/examples/mimo/utils/data_helpers.py b/examples/open_models/mimo/utils/data_helpers.py similarity index 100% rename from examples/mimo/utils/data_helpers.py rename to examples/open_models/mimo/utils/data_helpers.py diff --git a/examples/mimo/utils/logging.py b/examples/open_models/mimo/utils/logging.py similarity index 100% rename from examples/mimo/utils/logging.py rename to examples/open_models/mimo/utils/logging.py diff --git a/examples/mimo/utils/model_helpers.py b/examples/open_models/mimo/utils/model_helpers.py similarity index 100% rename from examples/mimo/utils/model_helpers.py rename to examples/open_models/mimo/utils/model_helpers.py diff --git a/examples/t5/README.md b/examples/open_models/t5/README.md similarity index 93% rename from examples/t5/README.md rename to examples/open_models/t5/README.md index 205da1db370..8fcf7b96c59 100644 --- a/examples/t5/README.md +++ b/examples/open_models/t5/README.md @@ -21,7 +21,7 @@ DATA_PATH="" #_text_document srun -N $NUM_NODES --container-image $PYTORCH_IMAGE --container-mounts "/path/to/data:/path/to/data,/path/to/megatron-lm:/workspace/megatron-lm" --account $ACCOUNT -N 1 -J $JOB_NAME -p $PARTITION --no-container-mount-home -c " cd /workspace/megatron-lm - ./examples/t5/train_t5_220m_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $DATA_PATH" + ./examples/open_models/t5/train_t5_220m_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $DATA_PATH" ``` diff --git a/examples/t5/t5_mcore_train_curve.png b/examples/open_models/t5/t5_mcore_train_curve.png similarity index 100% rename from examples/t5/t5_mcore_train_curve.png rename to examples/open_models/t5/t5_mcore_train_curve.png diff --git a/examples/t5/train_t5_220m_distributed.sh b/examples/open_models/t5/train_t5_220m_distributed.sh similarity index 100% rename from examples/t5/train_t5_220m_distributed.sh rename to examples/open_models/t5/train_t5_220m_distributed.sh diff --git a/megatron/core/README.md b/megatron/core/README.md index a9134be41cd..4fae09d9909 100644 --- a/megatron/core/README.md +++ b/megatron/core/README.md @@ -40,7 +40,7 @@ torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py - **[Simple Training Loop](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/run_simple_mcore_train_loop.py)** - Basic usage - **[Multimodal Training](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/multimodal/)** - Vision-language models - **[Mixture-of-Experts](https://github.com/yanring/Megatron-MoE-ModelZoo)** - MoE examples -- **[Mamba Models](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/mamba/)** - State-space models +- **[Mamba Models](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/open_models/mamba/)** - State-space models **Documentation:** - **[📚 API Guide](https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/index.html)** - Complete API documentation diff --git a/tests/test_utils/recipes/h100/mimo.yaml b/tests/test_utils/recipes/h100/mimo.yaml index 88b17815ede..5614f690682 100644 --- a/tests/test_utils/recipes/h100/mimo.yaml +++ b/tests/test_utils/recipes/h100/mimo.yaml @@ -50,7 +50,7 @@ spec: "TENSORBOARD_PATH={assets_dir}/tensorboard" "CHECKPOINT_SAVE_PATH={artifacts_dir}/checkpoints" "CHECKPOINT_LOAD_PATH=/mnt/artifacts" - "TRAINING_SCRIPT_PATH=./examples/mimo/train.py" + "TRAINING_SCRIPT_PATH=./examples/open_models/mimo/train.py" "TRAINING_PARAMS_PATH=./tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml" "GOLDEN_VALUES_PATH=./tests/functional_tests/test_cases/{model}/{test_case}/golden_values_{environment}_{platforms}.json" "N_REPEAT={n_repeat}"