diff --git a/docs/get-started/quickstart.md b/docs/get-started/quickstart.md
index 61868e7877c..28514e3b2a3 100644
--- a/docs/get-started/quickstart.md
+++ b/docs/get-started/quickstart.md
@@ -33,7 +33,7 @@ torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
```bash
# 8 GPUs, FP8 precision, mock data
-./examples/llama/train_llama3_8b_fp8.sh
+./examples/open_models/llama/train_llama3_8b_fp8.sh
```
## Data Preparation
diff --git a/docs/models/llms.md b/docs/models/llms.md
index 6789a4c551c..3402c522713 100644
--- a/docs/models/llms.md
+++ b/docs/models/llms.md
@@ -34,12 +34,10 @@ See the [Megatron Bridge supported models list](https://github.com/NVIDIA-NeMo/M
## Example Scripts
Training examples for these models can be found in the `examples/` directory:
-- `examples/gpt3/` - GPT-3 training scripts
-- `examples/llama/` - LLaMA training scripts
-- `examples/mixtral/` - Mixtral MoE training
-- `examples/mamba/` - Mamba training scripts
-- `examples/bert/` - BERT training scripts
-- `examples/t5/` - T5 training scripts
+- `examples/open_models/gpt3/` - GPT-3 training scripts
+- `examples/open_models/llama/` - LLaMA training scripts
+- `examples/open_models/mamba/` - Mamba training scripts
+- `examples/open_models/t5/` - T5 training scripts
## Model Implementation
diff --git a/docs/models/multimodal.md b/docs/models/multimodal.md
index 66ed8ccd9cb..11475495894 100644
--- a/docs/models/multimodal.md
+++ b/docs/models/multimodal.md
@@ -14,7 +14,7 @@ Megatron Core supports multimodal models that combine language with vision, audi
- Unified embedding space across modalities
- Support for both vision-language and audio-vision-language models
-See [examples/mimo](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/mimo) for training scripts and examples.
+See [examples/open_models/mimo](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/open_models/mimo) for training scripts and examples.
## Vision-Language Models
@@ -52,7 +52,7 @@ For multimodal diffusion models (image generation, text-to-image, etc.), see [Ne
Multimodal training examples can be found in the following directories:
**MIMO Framework:**
-- `examples/mimo/` - Multimodal In/Out training with support for vision-language and audio-vision-language models
+- `examples/open_models/mimo/` - Multimodal In/Out training with support for vision-language and audio-vision-language models
**Specific Multimodal Models:**
- `examples/multimodal/` - LLaVA-style training with Mistral + CLIP
diff --git a/docs/user-guide/training-examples.md b/docs/user-guide/training-examples.md
index 2824c608c36..2295fd16b40 100644
--- a/docs/user-guide/training-examples.md
+++ b/docs/user-guide/training-examples.md
@@ -24,7 +24,7 @@ This example:
Train LLaMA-3 8B model with FP8 mixed precision on 8 GPUs:
```bash
-./examples/llama/train_llama3_8b_fp8.sh
+./examples/open_models/llama/train_llama3_8b_fp8.sh
```
**Configuration:**
diff --git a/examples/bert/README.md b/examples/bert/README.md
deleted file mode 100644
index 6c1fe95bf06..00000000000
--- a/examples/bert/README.md
+++ /dev/null
@@ -1,53 +0,0 @@
-# BERT MODEL
-
-## Table of contents
-- [1. Training Setup](#1-training-setup)
-- [2. Configurations](#2-configurations)
-
-## 1. Training setup
-
-
-To run the model using a docker container run it as follows
-```
-PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:24.01-py3
-CHECKPOINT_PATH="" #
-TENSORBOARD_LOGS_PATH=""#
-VOCAB_FILE="" #//bert-vocab.txt
-DATA_PATH="" #_text_document
-
-docker run \
- --gpus=all \
- --ipc=host \
- --workdir /workspace/megatron-lm \
- -v /path/to/data:/path/to/data \
- -v /path/to/megatron-lm:/workspace/megatron-lm \
- megatron-lm nvcr.io/nvidia/pytorch:24.01-py3 \
- bash examples/bert/train_bert_340m_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $DATA_PATH "
-
-```
-NOTE: Depending on the environment you are running it the above command might like slightly different.
-
-
-## 2. Configurations
-
-The example in this folder shows you how to run 340m large model. There are other configs you could run as well
-
-### 4B
-```
- --num-layers 48 \
- --hidden-size 2560 \
- --num-attention-heads 32 \
- --tensor-model-parallel-size 1 \
- --pipeline-model-parallel-size 1 \
-
-```
-
-### 20B
-```
- --num-layers 48 \
- --hidden-size 6144 \
- --num-attention-heads 96 \
- --tensor-model-parallel-size 4 \
- --pipeline-model-parallel-size 4 \
-
-```
\ No newline at end of file
diff --git a/examples/bert/train_bert_340m_distributed.sh b/examples/bert/train_bert_340m_distributed.sh
deleted file mode 100644
index f0d9c87c8bf..00000000000
--- a/examples/bert/train_bert_340m_distributed.sh
+++ /dev/null
@@ -1,79 +0,0 @@
-#!/bin/bash
-
-# Runs the "340M" parameter model (Bert - Large)
-
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-GPUS_PER_NODE=8
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NUM_NODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))
-
-CHECKPOINT_PATH=$1 #
-TENSORBOARD_LOGS_PATH=$2 #
-VOCAB_FILE=$3 #/bert-vocab.json
-DATA_PATH=$4 #_text_document
-
-DISTRIBUTED_ARGS=(
- --nproc_per_node $GPUS_PER_NODE
- --nnodes $NUM_NODES
- --master_addr $MASTER_ADDR
- --master_port $MASTER_PORT
-)
-
-BERT_MODEL_ARGS=(
- --num-layers 24
- --hidden-size 1024
- --num-attention-heads 16
- --seq-length 512
- --max-position-embeddings 512
- --attention-backend auto # Can use (flash/fused/unfused/local)
-)
-
-TRAINING_ARGS=(
- --micro-batch-size 4
- --global-batch-size 32
- --train-iters 1000000
- --weight-decay 1e-2
- --clip-grad 1.0
- --fp16
- --lr 0.0001
- --lr-decay-iters 990000
- --lr-decay-style linear
- --min-lr 1.0e-5
- --weight-decay 1e-2
- --lr-warmup-fraction .01
- --clip-grad 1.0
-)
-
-MODEL_PARALLEL_ARGS=(
- --tensor-model-parallel-size 8
- --pipeline-model-parallel-size 16
-)
-
-DATA_ARGS=(
- --data-path $DATA_PATH
- --vocab-file $VOCAB_FILE
- --split 949,50,1
-)
-
-EVAL_AND_LOGGING_ARGS=(
- --log-interval 100
- --save-interval 10000
- --eval-interval 1000
- --save $CHECKPOINT_PATH
- --load $CHECKPOINT_PATH
- --eval-iters 10
- --tensorboard-dir $TENSORBOARD_LOGS_PATH
-)
-
-torchrun ${DISTRIBUTED_ARGS[@]} pretrain_bert.py \
- ${BERT_MODEL_ARGS[@]} \
- ${TRAINING_ARGS[@]} \
- ${MODEL_PARALLEL_ARGS[@]} \
- ${DATA_ARGS[@]} \
- ${EVAL_AND_LOGGING_ARGS[@]}
-
\ No newline at end of file
diff --git a/examples/mixtral/README.md b/examples/mixtral/README.md
deleted file mode 100644
index e85eccd6efd..00000000000
--- a/examples/mixtral/README.md
+++ /dev/null
@@ -1,132 +0,0 @@
-# Mixtral 8x7B Model Inference and Finetuning
-
-## Download Mixtral 8x7B Checkpoints
-Download Mixtral 8x7B HF format checkpoint from [HF-hub](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/)
-
-Or you can simply run this following script to download Mixtral 8x7B into a specific folder.
-```python
-from huggingface_hub import snapshot_download
-SAVED_DIR = "" # Specify the saved directory
-# Download HF checkpoints
-snapshot_download(repo_id="mistralai/Mixtral-8x7B-v0.1", ignore_patterns=["*.pt"], local_dir=SAVED_DIR, local_dir_use_symlinks=False)
-```
-
-## Convert Mixtral 8x7B checkpoints from HF to MCore
-The HF checkpoints can be converted to Megatron format by using the provided checkpoint converter for HF format.
-The target model parallel size(e.g. TP,PP,EP) should be specified.
-
-Currently the converter doesn't support distributed checkpointing yet, so each different parallel config requires a specific checkpoint.
-- For training, the recommended model parallel config is TP1EP8PP4
-- For inference, the recommended model parallel config is TP1EP1PP2
-
-```
-TOKENIZER_MODEL=/workspace/checkpoints/mixtral-hf/tokenizer.model
-MEGATRON_PATH="/workspace/megatron-lm"
-export PYTHONPATH=$MEGATRON_PATH:$PYTHONPATH
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-TARGET_TP_SIZE=""
-TARGET_EP_SIZE=""
-TARGET_PP_SIZE=""
-
-HF_FORMAT_DIR=/workspace/checkpoints/mixtral-hf
-MEGATRON_FORMAT_DIR=/workspace/checkpoints/mixtral-mcore-TP${TARGET_TP_SIZE}PP${TARGET_PP_SIZE}EP${TARGET_EP_SIZE}
-
-python tools/checkpoint/convert.py \
---model-type GPT \
---loader loader_mixtral_hf \
---saver mcore \
---target-tensor-parallel-size ${TARGET_TP_SIZE} \
---target-pipeline-parallel-size ${TARGET_PP_SIZE} \
---target-expert-parallel-size ${TARGET_EP_SIZE} \
---load-dir ${HF_FORMAT_DIR} \
---save-dir ${MEGATRON_FORMAT_DIR} \
---tokenizer-model ${TOKENIZER_MODEL}
-```
-
-## Text generation with Mixtral 8x7B
-Inference with Mixtral 8x7B requires at least 2 GPUS, such that a distributed checkpoint with EP>=2 or PP>=2 converted with above script is needed.
-
-The Megatron-LM have included a simple REST server to use for text generation in `tools/run_text_generation_server.py`, launch it with the following script:
-```
-#!/bin/bash
-# This example will start serving the Mixtral 8x7B model.
-DISTRIBUTED_ARGS="--nproc_per_node 2 \
- --nnodes 1 \
- --node_rank 0 \
- --master_addr localhost \
- --master_port 6000"
-
-CHECKPOINT=
-TOKENIZER_MODEL=
-
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-pip install flask-restful
-
-torchrun $DISTRIBUTED_ARGS tools/run_text_generation_server.py \
- --tensor-model-parallel-size 1 \
- --pipeline-model-parallel-size 2 \
- --expert-model-parallel-size 1 \
- --load ${CHECKPOINT} \
- --tokenizer-type Llama2Tokenizer \
- --tokenizer-model $TOKENIZER_MODEL \
- --use-mcore-models \
- --max-position-embeddings 32768 \
- --num-layers 32 \
- --hidden-size 4096 \
- --ffn-hidden-size 14336 \
- --num-attention-heads 32 \
- --normalization RMSNorm \
- --disable-bias-linear \
- --position-embedding-type rope \
- --no-position-embedding \
- --swiglu \
- --untie-embeddings-and-output-weights \
- --group-query-attention \
- --num-query-groups 8 \
- --bf16 \
- --micro-batch-size 1 \
- --seq-length 1024 \
- --seed 42 \
- --num-experts 8 \
- --moe-router-topk 2 \
- --moe-token-dispatcher-type alltoall \
- --moe-grouped-gemm \
- --mock-data \
- --rotary-base 1000000
-```
-
-Once the server is running you can use `tools/text_generation_cli.py` to query it, it takes one argument which is the host the server is running on.
-
-```
-python tools/text_generation_cli.py localhost:5000
-```
-
-
-## Finetuning from pretrained Mixtral 8x7B
-To finetuning pretrained Mixtral 8x7B, use the following scripts:
-
-
-```bash
-PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:24.04-py3
-CHECKPOINT_PATH="" # Speicfy path to checkpoint dir
-TOKENIZER_MODEL="" # Specify path to tokenizer.model
-DATA_PATH="" # Specify path to data
-
-docker run \
- --gpus=all \
- --ipc=host \
- --workdir /workspace/megatron-lm \
- -v /path/to/data:/path/to/data \
- -v /path/to/megatron-lm:/workspace/megatron-lm \
- $PYTORCH_IMAGE \
- bash examples/mixtral/train_mixtral_8x7b_distributed.sh $CHECKPOINT_PATH $TOKENIZER_MODEL $DATA_PATH
-```
-
-The above functionality also applys to Mixtral 8x22B actually, you should set the model config (including hidden_size/head_num/num_layers/ffn_hidden_size) properly according to the original [config](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1/blob/main/config.json).
-
-## Acknowledgements
-Contributors outside NVIDIA for the huggingface converter and example of Mixtral models in Megatron-Core:
-- Peng Li
-- Jun Huang
diff --git a/examples/mixtral/train_mixtral_8x7b_distributed.sh b/examples/mixtral/train_mixtral_8x7b_distributed.sh
deleted file mode 100644
index ed44d60f5c0..00000000000
--- a/examples/mixtral/train_mixtral_8x7b_distributed.sh
+++ /dev/null
@@ -1,116 +0,0 @@
-#!/bin/bash
-
-# Runs Mixtral 8x7B model
-
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-GPUS_PER_NODE=8
-# Change for multinode config
-MASTER_ADDR=${MASTER_ADDR:-"localhost"}
-MASTER_PORT=${MASTER_PORT:-"6000"}
-NNODES=${SLURM_NNODES:-"1"}
-NODE_RANK=${RANK:-"0"}
-WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
-
-CHECKPOINT_PATH=$1
-TOKENIZER_MODEL=$2
-DATA_PATH=$3
-
-DISTRIBUTED_ARGS=(
- --nproc_per_node $GPUS_PER_NODE
- --nnodes $NNODES
- --node_rank $NODE_RANK
- --master_addr $MASTER_ADDR
- --master_port $MASTER_PORT
-)
-
-MODEL_ARGS=(
- --use-mcore-models
- --disable-bias-linear
- --seq-length 4096
- --max-position-embeddings 32768
- --num-layers 32
- --hidden-size 4096
- --ffn-hidden-size 14336
- --num-attention-heads 32
- --init-method-std 0.01
- --attention-dropout 0.0
- --hidden-dropout 0.0
- --normalization RMSNorm
- --position-embedding-type rope
- --swiglu
- --untie-embeddings-and-output-weights
- --group-query-attention
- --num-query-groups 8
- --no-masked-softmax-fusion
- --no-position-embedding
- --rotary-base 1000000
-)
-
-MOE_ARGS=(
- --num-experts 8
- --moe-router-topk 2
- --moe-router-load-balancing-type aux_loss
- --moe-aux-loss-coeff 1e-2
- --moe-grouped-gemm
- --moe-token-dispatcher-type alltoall
- --overlap-param-gather
- --overlap-grad-reduce
-)
-
-DATA_ARGS=(
- --tokenizer-type Llama2Tokenizer
- --tokenizer-model ${TOKENIZER_MODEL}
- --data-path $DATA_PATH
- --split 99990,8,2
-)
-
-TRAINING_ARGS=(
- --micro-batch-size 1
- --global-batch-size 256
- --lr 1e-4
- --train-iters 500000
- --lr-decay-iters 320000
- --lr-decay-style cosine
- --min-lr 1.0e-5
- --weight-decay 0.1
- --lr-warmup-iters 500
- --clip-grad 1.0
- --bf16
-)
-
-MODEL_PARALLEL_ARGS=(
- --tensor-model-parallel-size 1
- --pipeline-model-parallel-size 4
- --expert-model-parallel-size 8
- --use-distributed-optimizer
- --sequence-parallel
-)
-
-LOGGING_ARGS=(
- --log-interval 1 \
- --save-interval 10000 \
- --eval-interval 1000 \
- --eval-iters 10 \
- --save $CHECKPOINT_PATH \
- --load $CHECKPOINT_PATH \
- --tensorboard-dir "${CHECKPOINT_PATH}/tensorboard" \
- --no-load-optim \
- --no-load-rng
-)
-
-if [ -n "${WANDB_API_KEY}" ]; then
- LOGGING_ARGS+=(
- --wandb-project ${WANDB_PROJECT:-"Mixtral"}
- --wandb-exp-name ${WANDB_NAME:-"Mixtral_8x7B"}
- )
-fi
-
-
-torchrun ${DISTRIBUTED_ARGS[@]} pretrain_gpt.py \
- ${MODEL_ARGS[@]} \
- ${MOE_ARGS[@]} \
- ${DATA_ARGS[@]} \
- ${TRAINING_ARGS[@]} \
- ${MODEL_PARALLEL_ARGS[@]} \
- ${LOGGING_ARGS[@]}
diff --git a/examples/gpt3/README.md b/examples/open_models/gpt3/README.md
similarity index 91%
rename from examples/gpt3/README.md
rename to examples/open_models/gpt3/README.md
index 8d6f2674163..4bf11cdb7e1 100644
--- a/examples/gpt3/README.md
+++ b/examples/open_models/gpt3/README.md
@@ -24,7 +24,7 @@ docker run \
-v /path/to/data:/path/to/data \
-v /path/to/megatron-lm:/workspace/megatron-lm \
megatron-lm nvcr.io/nvidia/pytorch:24.01-py3 \
- bash examples/gpt3/train_gpt3_175b_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $MERGE_FILE $DATA_PATH "
+ bash examples/open_models/gpt3/train_gpt3_175b_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $MERGE_FILE $DATA_PATH "
```
NOTE: Depending on the environment you are running it the above command might like slightly different.
diff --git a/examples/gpt3/gpt_config.yaml b/examples/open_models/gpt3/gpt_config.yaml
similarity index 100%
rename from examples/gpt3/gpt_config.yaml
rename to examples/open_models/gpt3/gpt_config.yaml
diff --git a/examples/gpt3/train_gpt3_175b_distributed.sh b/examples/open_models/gpt3/train_gpt3_175b_distributed.sh
similarity index 100%
rename from examples/gpt3/train_gpt3_175b_distributed.sh
rename to examples/open_models/gpt3/train_gpt3_175b_distributed.sh
diff --git a/examples/llama/README.md b/examples/open_models/llama/README.md
similarity index 97%
rename from examples/llama/README.md
rename to examples/open_models/llama/README.md
index 9872185ab2f..c4470cd4812 100644
--- a/examples/llama/README.md
+++ b/examples/open_models/llama/README.md
@@ -45,7 +45,7 @@ docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \
-v "${HOST_TENSORBOARD_LOGS_PATH}:/workspace/tensorboard_logs" \
--workdir /workspace/megatron-lm \
$PYTORCH_IMAGE \
- bash examples/llama/train_llama3_8b_h100_fp8.sh \
+ bash examples/open_models/llama/train_llama3_8b_h100_fp8.sh \
/workspace/checkpoints \
/workspace/tensorboard_logs \
2>&1 | tee "${HOST_TENSORBOARD_LOGS_PATH}/training_mock_$(date +'%y-%m-%d_%H-%M-%S').log"
@@ -63,7 +63,7 @@ docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \
-v "$(dirname "${HOST_DATA_PREFIX}"):/workspace/data_dir" \
--workdir /workspace/megatron-lm \
$PYTORCH_IMAGE \
- bash examples/llama/train_llama3_8b_h100_fp8.sh \
+ bash examples/open_models/llama/train_llama3_8b_h100_fp8.sh \
/workspace/checkpoints \
/workspace/tensorboard_logs \
/workspace/tokenizer_model \
diff --git a/examples/llama/train_llama3_8b_h100_fp8.sh b/examples/open_models/llama/train_llama3_8b_h100_fp8.sh
similarity index 100%
rename from examples/llama/train_llama3_8b_h100_fp8.sh
rename to examples/open_models/llama/train_llama3_8b_h100_fp8.sh
diff --git a/examples/mamba/.gitignore b/examples/open_models/mamba/.gitignore
similarity index 100%
rename from examples/mamba/.gitignore
rename to examples/open_models/mamba/.gitignore
diff --git a/examples/mamba/Dockerfile b/examples/open_models/mamba/Dockerfile
similarity index 100%
rename from examples/mamba/Dockerfile
rename to examples/open_models/mamba/Dockerfile
diff --git a/examples/mamba/README.md b/examples/open_models/mamba/README.md
similarity index 96%
rename from examples/mamba/README.md
rename to examples/open_models/mamba/README.md
index f8f6d796837..f26a3de8dfe 100644
--- a/examples/mamba/README.md
+++ b/examples/open_models/mamba/README.md
@@ -22,7 +22,7 @@ docker run --gpus all -it --rm \
-v /path/to/megatron:/workspace/megatron \
-v /path/to/dataset:/workspace/dataset \
-v /path/to/checkpoints:/workspace/checkpoints \
- -w /workspace/megatron/examples/mamba \
+ -w /workspace/megatron/examples/open_models/mamba \
your_image_name:your_tag
```
@@ -55,7 +55,7 @@ including the hybrid layer configuration and model parallel configuration.
If you need to convert a hybrid checkpoint file to a different tensor parallel
or pipeline parallel size, use
-[the hybrid conversion script](../../tools/checkpoint/hybrid_conversion.py).
+[the hybrid conversion script](../../../tools/checkpoint/hybrid_conversion.py).
There is an example run command at the end of that file.
Before running that script, you will need to set `PYTHONPATH` to include the
diff --git a/examples/mamba/run_text_gen_server_8b.sh b/examples/open_models/mamba/run_text_gen_server_8b.sh
similarity index 89%
rename from examples/mamba/run_text_gen_server_8b.sh
rename to examples/open_models/mamba/run_text_gen_server_8b.sh
index 8d3137f2442..ef328fb0063 100755
--- a/examples/mamba/run_text_gen_server_8b.sh
+++ b/examples/open_models/mamba/run_text_gen_server_8b.sh
@@ -1,7 +1,7 @@
#!/bin/bash
# Use: ./run_text_gen_server_8b.sh
-# To launch the client: python ../../tools/text_generation_cli.py
+# To launch the client: python ../../../tools/text_generation_cli.py
CHECKPOINT_PATH=$1
TOKENIZER_PATH=$2
@@ -20,7 +20,7 @@ export NCCL_IB_QPS_PER_CONNECTION=4
export TRITON_CACHE_DIR="./triton-cache/"
export TRITON_CACHE_MANAGER="megatron.core.ssm.triton_cache_manager:ParallelFileCacheManager"
-torchrun $DISTRIBUTED_ARGS ../../tools/run_mamba_text_generation_server.py \
+torchrun $DISTRIBUTED_ARGS ../../../tools/run_mamba_text_generation_server.py \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--untie-embeddings-and-output-weights \
diff --git a/examples/mamba/run_text_gen_server_8b_gpt3.sh b/examples/open_models/mamba/run_text_gen_server_8b_gpt3.sh
similarity index 88%
rename from examples/mamba/run_text_gen_server_8b_gpt3.sh
rename to examples/open_models/mamba/run_text_gen_server_8b_gpt3.sh
index 5413b245ed3..f427a76dba5 100644
--- a/examples/mamba/run_text_gen_server_8b_gpt3.sh
+++ b/examples/open_models/mamba/run_text_gen_server_8b_gpt3.sh
@@ -1,7 +1,7 @@
#!/bin/bash
# Use: ./run_text_gen_server_8b_gpt3.sh
-# To launch the client: python ../../tools/text_generation_cli.py
+# To launch the client: python ../../../tools/text_generation_cli.py
CHECKPOINT_PATH=$1
TOKENIZER_PATH=$2
@@ -17,7 +17,7 @@ export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_IB_TIMEOUT=19
export NCCL_IB_QPS_PER_CONNECTION=4
-torchrun $DISTRIBUTED_ARGS ../../tools/run_text_generation_server.py \
+torchrun $DISTRIBUTED_ARGS ../../../tools/run_text_generation_server.py \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--use-flash-attn \
diff --git a/examples/mamba/train.sh b/examples/open_models/mamba/train.sh
similarity index 97%
rename from examples/mamba/train.sh
rename to examples/open_models/mamba/train.sh
index 3952a997d47..b6f780d09aa 100755
--- a/examples/mamba/train.sh
+++ b/examples/open_models/mamba/train.sh
@@ -102,4 +102,4 @@ options=" \
--no-create-attention-mask-in-dataloader \
--tensorboard-dir ${TENSORBOARD_DIR}"
-torchrun --nproc_per_node 8 ../../pretrain_mamba.py ${options}
+torchrun --nproc_per_node 8 ../../../pretrain_mamba.py ${options}
diff --git a/examples/mimo/__init__.py b/examples/open_models/mimo/__init__.py
similarity index 100%
rename from examples/mimo/__init__.py
rename to examples/open_models/mimo/__init__.py
diff --git a/examples/mimo/avlm_inference.py b/examples/open_models/mimo/avlm_inference.py
similarity index 100%
rename from examples/mimo/avlm_inference.py
rename to examples/open_models/mimo/avlm_inference.py
diff --git a/examples/mimo/configs/llava_avlm.py b/examples/open_models/mimo/configs/llava_avlm.py
similarity index 100%
rename from examples/mimo/configs/llava_avlm.py
rename to examples/open_models/mimo/configs/llava_avlm.py
diff --git a/examples/mimo/configs/llava_vlm.py b/examples/open_models/mimo/configs/llava_vlm.py
similarity index 100%
rename from examples/mimo/configs/llava_vlm.py
rename to examples/open_models/mimo/configs/llava_vlm.py
diff --git a/examples/mimo/configs/mock.py b/examples/open_models/mimo/configs/mock.py
similarity index 100%
rename from examples/mimo/configs/mock.py
rename to examples/open_models/mimo/configs/mock.py
diff --git a/examples/mimo/data/__init__.py b/examples/open_models/mimo/data/__init__.py
similarity index 100%
rename from examples/mimo/data/__init__.py
rename to examples/open_models/mimo/data/__init__.py
diff --git a/examples/mimo/data/avlm_sample_loader.py b/examples/open_models/mimo/data/avlm_sample_loader.py
similarity index 100%
rename from examples/mimo/data/avlm_sample_loader.py
rename to examples/open_models/mimo/data/avlm_sample_loader.py
diff --git a/examples/mimo/data/energon_avlm_task_encoder.py b/examples/open_models/mimo/data/energon_avlm_task_encoder.py
similarity index 100%
rename from examples/mimo/data/energon_avlm_task_encoder.py
rename to examples/open_models/mimo/data/energon_avlm_task_encoder.py
diff --git a/examples/mimo/data/energon_vlm_task_encoder.py b/examples/open_models/mimo/data/energon_vlm_task_encoder.py
similarity index 100%
rename from examples/mimo/data/energon_vlm_task_encoder.py
rename to examples/open_models/mimo/data/energon_vlm_task_encoder.py
diff --git a/examples/mimo/data/mock.py b/examples/open_models/mimo/data/mock.py
similarity index 100%
rename from examples/mimo/data/mock.py
rename to examples/open_models/mimo/data/mock.py
diff --git a/examples/mimo/data/prepare_video_llava_data.py b/examples/open_models/mimo/data/prepare_video_llava_data.py
similarity index 100%
rename from examples/mimo/data/prepare_video_llava_data.py
rename to examples/open_models/mimo/data/prepare_video_llava_data.py
diff --git a/examples/mimo/data/utils/calculate_audio_tokens.py b/examples/open_models/mimo/data/utils/calculate_audio_tokens.py
similarity index 100%
rename from examples/mimo/data/utils/calculate_audio_tokens.py
rename to examples/open_models/mimo/data/utils/calculate_audio_tokens.py
diff --git a/examples/mimo/model_providers/__init__.py b/examples/open_models/mimo/model_providers/__init__.py
similarity index 100%
rename from examples/mimo/model_providers/__init__.py
rename to examples/open_models/mimo/model_providers/__init__.py
diff --git a/examples/mimo/model_providers/hf_clip_encoder.py b/examples/open_models/mimo/model_providers/hf_clip_encoder.py
similarity index 100%
rename from examples/mimo/model_providers/hf_clip_encoder.py
rename to examples/open_models/mimo/model_providers/hf_clip_encoder.py
diff --git a/examples/mimo/model_providers/hf_whisper_encoder.py b/examples/open_models/mimo/model_providers/hf_whisper_encoder.py
similarity index 100%
rename from examples/mimo/model_providers/hf_whisper_encoder.py
rename to examples/open_models/mimo/model_providers/hf_whisper_encoder.py
diff --git a/examples/mimo/model_providers/llava_avlm.py b/examples/open_models/mimo/model_providers/llava_avlm.py
similarity index 100%
rename from examples/mimo/model_providers/llava_avlm.py
rename to examples/open_models/mimo/model_providers/llava_avlm.py
diff --git a/examples/mimo/model_providers/llava_vlm.py b/examples/open_models/mimo/model_providers/llava_vlm.py
similarity index 100%
rename from examples/mimo/model_providers/llava_vlm.py
rename to examples/open_models/mimo/model_providers/llava_vlm.py
diff --git a/examples/mimo/model_providers/mock.py b/examples/open_models/mimo/model_providers/mock.py
similarity index 100%
rename from examples/mimo/model_providers/mock.py
rename to examples/open_models/mimo/model_providers/mock.py
diff --git a/examples/mimo/scripts/run_avlm_train.sh b/examples/open_models/mimo/scripts/run_avlm_train.sh
similarity index 94%
rename from examples/mimo/scripts/run_avlm_train.sh
rename to examples/open_models/mimo/scripts/run_avlm_train.sh
index ced70cd0047..6b5cfbb359f 100755
--- a/examples/mimo/scripts/run_avlm_train.sh
+++ b/examples/open_models/mimo/scripts/run_avlm_train.sh
@@ -116,7 +116,7 @@ AUDIO_MODEL_ARGS=(
if [ "$DEBUG_MODE" = true ]; then
echo "Running in debug mode with $GPUS_PER_NODE GPU(s) per node..."
echo "Debugger listening on port $DEBUG_PORT - connect with your IDE to this port"
- debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+ debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]} \
@@ -128,7 +128,7 @@ else
echo "Running in normal mode with $GPUS_PER_NODE GPU(s) per node..."
if [ "$DRY_RUN" = true ]; then
echo "Dry run mode enabled"
- echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+ echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]} \
@@ -137,7 +137,7 @@ else
${AUDIO_MODEL_ARGS[@]} \
${DATASET_ARGS[@]}"
else
- torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+ torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]} \
diff --git a/examples/mimo/scripts/run_mock_train.sh b/examples/open_models/mimo/scripts/run_mock_train.sh
similarity index 91%
rename from examples/mimo/scripts/run_mock_train.sh
rename to examples/open_models/mimo/scripts/run_mock_train.sh
index 2ed71cd5ede..5970fe60767 100755
--- a/examples/mimo/scripts/run_mock_train.sh
+++ b/examples/open_models/mimo/scripts/run_mock_train.sh
@@ -1,7 +1,7 @@
#!/bin/bash
# from the root of the repo
-# ./examples/mimo/scripts/run_mock_train.sh
+# ./examples/open_models/mimo/scripts/run_mock_train.sh
export CUDA_DEVICE_MAX_CONNECTIONS=1
DRY_RUN=false
@@ -82,7 +82,7 @@ GPT_MODEL_ARGS=(
if [ "$DEBUG_MODE" = true ]; then
echo "Running in debug mode with $GPUS_PER_NODE GPU(s) per node..."
echo "Debugger listening on port $DEBUG_PORT - connect with your IDE to this port"
- debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+ debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]} \
@@ -92,14 +92,14 @@ else
echo "Running in normal mode with $GPUS_PER_NODE GPU(s) per node..."
if [ "$DRY_RUN" = true ]; then
echo "Dry run mode enabled"
- echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+ echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]} \
${TOKENIZER_ARGS[@]} \
${GPT_MODEL_ARGS[@]}"
else
- torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+ torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]} \
diff --git a/examples/mimo/scripts/run_video_vlm_train.sh b/examples/open_models/mimo/scripts/run_video_vlm_train.sh
similarity index 94%
rename from examples/mimo/scripts/run_video_vlm_train.sh
rename to examples/open_models/mimo/scripts/run_video_vlm_train.sh
index 2ec8af9d55f..ff63c1fc495 100755
--- a/examples/mimo/scripts/run_video_vlm_train.sh
+++ b/examples/open_models/mimo/scripts/run_video_vlm_train.sh
@@ -110,7 +110,7 @@ GPT_MODEL_ARGS=(
if [ "$DEBUG_MODE" = true ]; then
echo "Running in debug mode with $GPUS_PER_NODE GPU(s) per node..."
echo "Debugger listening on port $DEBUG_PORT - connect with your IDE to this port"
- debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+ debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]} \
@@ -121,7 +121,7 @@ else
echo "Running in normal mode with $GPUS_PER_NODE GPU(s) per node..."
if [ "$DRY_RUN" = true ]; then
echo "Dry run mode enabled"
- echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+ echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]} \
@@ -129,7 +129,7 @@ else
${GPT_MODEL_ARGS[@]} \
${DATASET_ARGS[@]}"
else
- torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+ torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]} \
diff --git a/examples/mimo/scripts/run_vlm_train.sh b/examples/open_models/mimo/scripts/run_vlm_train.sh
similarity index 94%
rename from examples/mimo/scripts/run_vlm_train.sh
rename to examples/open_models/mimo/scripts/run_vlm_train.sh
index 702b29795a4..bc95f9cc6c1 100755
--- a/examples/mimo/scripts/run_vlm_train.sh
+++ b/examples/open_models/mimo/scripts/run_vlm_train.sh
@@ -111,7 +111,7 @@ GPT_MODEL_ARGS=(
if [ "$DEBUG_MODE" = true ]; then
echo "Running in debug mode with $GPUS_PER_NODE GPU(s) per node..."
echo "Debugger listening on port $DEBUG_PORT - connect with your IDE to this port"
- debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+ debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]} \
@@ -122,7 +122,7 @@ else
echo "Running in normal mode with $GPUS_PER_NODE GPU(s) per node..."
if [ "$DRY_RUN" = true ]; then
echo "Dry run mode enabled"
- echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+ echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]} \
@@ -130,7 +130,7 @@ else
${GPT_MODEL_ARGS[@]} \
${DATASET_ARGS[@]}"
else
- torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+ torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]} \
diff --git a/examples/mimo/train.py b/examples/open_models/mimo/train.py
similarity index 100%
rename from examples/mimo/train.py
rename to examples/open_models/mimo/train.py
diff --git a/examples/mimo/utils/__init__.py b/examples/open_models/mimo/utils/__init__.py
similarity index 100%
rename from examples/mimo/utils/__init__.py
rename to examples/open_models/mimo/utils/__init__.py
diff --git a/examples/mimo/utils/data_helpers.py b/examples/open_models/mimo/utils/data_helpers.py
similarity index 100%
rename from examples/mimo/utils/data_helpers.py
rename to examples/open_models/mimo/utils/data_helpers.py
diff --git a/examples/mimo/utils/logging.py b/examples/open_models/mimo/utils/logging.py
similarity index 100%
rename from examples/mimo/utils/logging.py
rename to examples/open_models/mimo/utils/logging.py
diff --git a/examples/mimo/utils/model_helpers.py b/examples/open_models/mimo/utils/model_helpers.py
similarity index 100%
rename from examples/mimo/utils/model_helpers.py
rename to examples/open_models/mimo/utils/model_helpers.py
diff --git a/examples/t5/README.md b/examples/open_models/t5/README.md
similarity index 93%
rename from examples/t5/README.md
rename to examples/open_models/t5/README.md
index 205da1db370..8fcf7b96c59 100644
--- a/examples/t5/README.md
+++ b/examples/open_models/t5/README.md
@@ -21,7 +21,7 @@ DATA_PATH="" #_text_document
srun -N $NUM_NODES --container-image $PYTORCH_IMAGE --container-mounts "/path/to/data:/path/to/data,/path/to/megatron-lm:/workspace/megatron-lm" --account $ACCOUNT -N 1 -J $JOB_NAME -p $PARTITION --no-container-mount-home -c "
cd /workspace/megatron-lm
- ./examples/t5/train_t5_220m_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $DATA_PATH"
+ ./examples/open_models/t5/train_t5_220m_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $DATA_PATH"
```
diff --git a/examples/t5/t5_mcore_train_curve.png b/examples/open_models/t5/t5_mcore_train_curve.png
similarity index 100%
rename from examples/t5/t5_mcore_train_curve.png
rename to examples/open_models/t5/t5_mcore_train_curve.png
diff --git a/examples/t5/train_t5_220m_distributed.sh b/examples/open_models/t5/train_t5_220m_distributed.sh
similarity index 100%
rename from examples/t5/train_t5_220m_distributed.sh
rename to examples/open_models/t5/train_t5_220m_distributed.sh
diff --git a/megatron/core/README.md b/megatron/core/README.md
index a9134be41cd..4fae09d9909 100644
--- a/megatron/core/README.md
+++ b/megatron/core/README.md
@@ -40,7 +40,7 @@ torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
- **[Simple Training Loop](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/run_simple_mcore_train_loop.py)** - Basic usage
- **[Multimodal Training](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/multimodal/)** - Vision-language models
- **[Mixture-of-Experts](https://github.com/yanring/Megatron-MoE-ModelZoo)** - MoE examples
-- **[Mamba Models](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/mamba/)** - State-space models
+- **[Mamba Models](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/open_models/mamba/)** - State-space models
**Documentation:**
- **[📚 API Guide](https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/index.html)** - Complete API documentation
diff --git a/tests/test_utils/recipes/h100/mimo.yaml b/tests/test_utils/recipes/h100/mimo.yaml
index 88b17815ede..5614f690682 100644
--- a/tests/test_utils/recipes/h100/mimo.yaml
+++ b/tests/test_utils/recipes/h100/mimo.yaml
@@ -50,7 +50,7 @@ spec:
"TENSORBOARD_PATH={assets_dir}/tensorboard"
"CHECKPOINT_SAVE_PATH={artifacts_dir}/checkpoints"
"CHECKPOINT_LOAD_PATH=/mnt/artifacts"
- "TRAINING_SCRIPT_PATH=./examples/mimo/train.py"
+ "TRAINING_SCRIPT_PATH=./examples/open_models/mimo/train.py"
"TRAINING_PARAMS_PATH=./tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml"
"GOLDEN_VALUES_PATH=./tests/functional_tests/test_cases/{model}/{test_case}/golden_values_{environment}_{platforms}.json"
"N_REPEAT={n_repeat}"