diff --git a/docs/get-started/quickstart.md b/docs/get-started/quickstart.md
index 61868e7877c..28514e3b2a3 100644
--- a/docs/get-started/quickstart.md
+++ b/docs/get-started/quickstart.md
@@ -33,7 +33,7 @@ torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
 
 ```bash
 # 8 GPUs, FP8 precision, mock data
-./examples/llama/train_llama3_8b_fp8.sh
+./examples/open_models/llama/train_llama3_8b_fp8.sh
 ```
 
 ## Data Preparation
diff --git a/docs/models/llms.md b/docs/models/llms.md
index 6789a4c551c..3402c522713 100644
--- a/docs/models/llms.md
+++ b/docs/models/llms.md
@@ -34,12 +34,10 @@ See the [Megatron Bridge supported models list](https://github.com/NVIDIA-NeMo/M
 ## Example Scripts
 
 Training examples for these models can be found in the `examples/` directory:
-- `examples/gpt3/` - GPT-3 training scripts
-- `examples/llama/` - LLaMA training scripts
-- `examples/mixtral/` - Mixtral MoE training
-- `examples/mamba/` - Mamba training scripts
-- `examples/bert/` - BERT training scripts
-- `examples/t5/` - T5 training scripts
+- `examples/open_models/gpt3/` - GPT-3 training scripts
+- `examples/open_models/llama/` - LLaMA training scripts
+- `examples/open_models/mamba/` - Mamba training scripts
+- `examples/open_models/t5/` - T5 training scripts
 
 ## Model Implementation
 
diff --git a/docs/models/multimodal.md b/docs/models/multimodal.md
index 66ed8ccd9cb..11475495894 100644
--- a/docs/models/multimodal.md
+++ b/docs/models/multimodal.md
@@ -14,7 +14,7 @@ Megatron Core supports multimodal models that combine language with vision, audi
 - Unified embedding space across modalities
 - Support for both vision-language and audio-vision-language models
 
-See [examples/mimo](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/mimo) for training scripts and examples.
+See [examples/open_models/mimo](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/open_models/mimo) for training scripts and examples.
 
 ## Vision-Language Models
 
@@ -52,7 +52,7 @@ For multimodal diffusion models (image generation, text-to-image, etc.), see [Ne
 Multimodal training examples can be found in the following directories:
 
 **MIMO Framework:**
-- `examples/mimo/` - Multimodal In/Out training with support for vision-language and audio-vision-language models
+- `examples/open_models/mimo/` - Multimodal In/Out training with support for vision-language and audio-vision-language models
 
 **Specific Multimodal Models:**
 - `examples/multimodal/` - LLaVA-style training with Mistral + CLIP
diff --git a/docs/user-guide/training-examples.md b/docs/user-guide/training-examples.md
index 2824c608c36..2295fd16b40 100644
--- a/docs/user-guide/training-examples.md
+++ b/docs/user-guide/training-examples.md
@@ -24,7 +24,7 @@ This example:
 Train LLaMA-3 8B model with FP8 mixed precision on 8 GPUs:
 
 ```bash
-./examples/llama/train_llama3_8b_fp8.sh
+./examples/open_models/llama/train_llama3_8b_fp8.sh
 ```
 
 **Configuration:**
diff --git a/examples/bert/README.md b/examples/bert/README.md
deleted file mode 100644
index 6c1fe95bf06..00000000000
--- a/examples/bert/README.md
+++ /dev/null
@@ -1,53 +0,0 @@
-# BERT MODEL
-
-## Table of contents
-- [1. Training Setup](#1-training-setup)
-- [2. Configurations](#2-configurations)
-
-## 1. Training setup
-<a id="markdown-training-setup" name="training-setup"></a>
-
-To run the model using a docker container run it as follows
-```
-PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:24.01-py3
-CHECKPOINT_PATH="" #<Specify path>
-TENSORBOARD_LOGS_PATH=""#<Specify path>
-VOCAB_FILE="" #<Specify path to file>//bert-vocab.txt
-DATA_PATH="" #<Specify path and file prefix>_text_document
-
-docker run \
-  --gpus=all \
-  --ipc=host \
-  --workdir /workspace/megatron-lm \
-  -v /path/to/data:/path/to/data \
-  -v /path/to/megatron-lm:/workspace/megatron-lm \
-  megatron-lm nvcr.io/nvidia/pytorch:24.01-py3 \
-  bash examples/bert/train_bert_340m_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $DATA_PATH "
-
-```
-NOTE: Depending on the environment you are running it the above command might like slightly different.
-
-
-## 2. Configurations
-<a id="markdown-configurations" name="configurations"></a>
-The example in this folder shows you how to run 340m large model. There are other configs you could run as well
-
-### 4B
-```
-       --num-layers 48 \
-       --hidden-size 2560 \
-       --num-attention-heads 32 \
-       --tensor-model-parallel-size 1 \
-       --pipeline-model-parallel-size 1 \
-
-```
-
-### 20B
-```
-       --num-layers 48 \
-       --hidden-size 6144 \
-       --num-attention-heads 96 \
-       --tensor-model-parallel-size 4 \
-       --pipeline-model-parallel-size 4 \
-
-```
\ No newline at end of file
diff --git a/examples/bert/train_bert_340m_distributed.sh b/examples/bert/train_bert_340m_distributed.sh
deleted file mode 100644
index f0d9c87c8bf..00000000000
--- a/examples/bert/train_bert_340m_distributed.sh
+++ /dev/null
@@ -1,79 +0,0 @@
-#!/bin/bash
-
-# Runs the "340M" parameter model (Bert - Large)
-
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-GPUS_PER_NODE=8
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NUM_NODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))
-
-CHECKPOINT_PATH=$1 #<Specify path>
-TENSORBOARD_LOGS_PATH=$2 #<Specify path>
-VOCAB_FILE=$3 #<Specify path to file>/bert-vocab.json
-DATA_PATH=$4 #<Specify path and file prefix>_text_document
-
-DISTRIBUTED_ARGS=(
-    --nproc_per_node $GPUS_PER_NODE 
-    --nnodes $NUM_NODES 
-    --master_addr $MASTER_ADDR 
-    --master_port $MASTER_PORT
-)
-
-BERT_MODEL_ARGS=(
-    --num-layers 24 
-    --hidden-size 1024 
-    --num-attention-heads 16 
-    --seq-length 512 
-    --max-position-embeddings 512 
-    --attention-backend auto # Can use (flash/fused/unfused/local)
-)
-
-TRAINING_ARGS=(
-    --micro-batch-size 4 
-    --global-batch-size 32 
-    --train-iters 1000000 
-    --weight-decay 1e-2 
-    --clip-grad 1.0 
-    --fp16
-    --lr 0.0001
-    --lr-decay-iters 990000 
-    --lr-decay-style linear 
-    --min-lr 1.0e-5 
-    --weight-decay 1e-2 
-    --lr-warmup-fraction .01 
-    --clip-grad 1.0 
-)
-
-MODEL_PARALLEL_ARGS=(
-	--tensor-model-parallel-size 8 
-	--pipeline-model-parallel-size 16 
-)
-
-DATA_ARGS=(
-    --data-path $DATA_PATH 
-    --vocab-file $VOCAB_FILE 
-    --split 949,50,1
-)
-
-EVAL_AND_LOGGING_ARGS=(
-    --log-interval 100
-    --save-interval 10000 
-    --eval-interval 1000 
-    --save $CHECKPOINT_PATH 
-    --load $CHECKPOINT_PATH 
-    --eval-iters 10
-    --tensorboard-dir $TENSORBOARD_LOGS_PATH 
-)
-
-torchrun ${DISTRIBUTED_ARGS[@]} pretrain_bert.py \
-    ${BERT_MODEL_ARGS[@]} \
-    ${TRAINING_ARGS[@]} \
-    ${MODEL_PARALLEL_ARGS[@]} \
-    ${DATA_ARGS[@]} \
-    ${EVAL_AND_LOGGING_ARGS[@]}
-    
\ No newline at end of file
diff --git a/examples/mixtral/README.md b/examples/mixtral/README.md
deleted file mode 100644
index e85eccd6efd..00000000000
--- a/examples/mixtral/README.md
+++ /dev/null
@@ -1,132 +0,0 @@
-# Mixtral 8x7B Model Inference and Finetuning
-
-## Download Mixtral 8x7B Checkpoints
-Download Mixtral 8x7B HF format checkpoint from [HF-hub](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/)
-
-Or you can simply run this following script to download Mixtral 8x7B into a specific folder.
-```python
-from huggingface_hub import snapshot_download
-SAVED_DIR = "" # Specify the saved directory
-# Download HF checkpoints
-snapshot_download(repo_id="mistralai/Mixtral-8x7B-v0.1", ignore_patterns=["*.pt"], local_dir=SAVED_DIR, local_dir_use_symlinks=False)
-```
-
-## Convert Mixtral 8x7B checkpoints from HF to MCore
-The HF checkpoints can be converted to Megatron format by using the provided checkpoint converter for HF format.
-The target model parallel size(e.g. TP,PP,EP) should be specified.
-
-Currently the converter doesn't support distributed checkpointing yet, so each different parallel config requires a specific checkpoint.
-- For training, the recommended model parallel config is TP1EP8PP4
-- For inference, the recommended model parallel config is TP1EP1PP2
-
-```
-TOKENIZER_MODEL=/workspace/checkpoints/mixtral-hf/tokenizer.model
-MEGATRON_PATH="/workspace/megatron-lm"
-export PYTHONPATH=$MEGATRON_PATH:$PYTHONPATH
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-TARGET_TP_SIZE=""
-TARGET_EP_SIZE=""
-TARGET_PP_SIZE=""
-
-HF_FORMAT_DIR=/workspace/checkpoints/mixtral-hf
-MEGATRON_FORMAT_DIR=/workspace/checkpoints/mixtral-mcore-TP${TARGET_TP_SIZE}PP${TARGET_PP_SIZE}EP${TARGET_EP_SIZE}
-
-python tools/checkpoint/convert.py \
---model-type GPT \
---loader loader_mixtral_hf \
---saver mcore \
---target-tensor-parallel-size ${TARGET_TP_SIZE} \
---target-pipeline-parallel-size ${TARGET_PP_SIZE} \
---target-expert-parallel-size ${TARGET_EP_SIZE} \
---load-dir ${HF_FORMAT_DIR} \
---save-dir ${MEGATRON_FORMAT_DIR} \
---tokenizer-model ${TOKENIZER_MODEL}
-```
-
-## Text generation with Mixtral 8x7B
-Inference with Mixtral 8x7B requires at least 2 GPUS, such that a distributed checkpoint with EP>=2 or PP>=2 converted with above script is needed.
-
-The Megatron-LM have included a simple REST server to use for text generation in `tools/run_text_generation_server.py`, launch it with the following script:
-```
-#!/bin/bash
-# This example will start serving the Mixtral 8x7B model.
-DISTRIBUTED_ARGS="--nproc_per_node 2 \
-                  --nnodes 1 \
-                  --node_rank 0 \
-                  --master_addr localhost \
-                  --master_port 6000"
-
-CHECKPOINT=<Path to checkpoint>
-TOKENIZER_MODEL=<Path to tokenizer (e.g. /tokenizer.model)>
-
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-pip install flask-restful
-
-torchrun $DISTRIBUTED_ARGS tools/run_text_generation_server.py   \
-       --tensor-model-parallel-size 1  \
-       --pipeline-model-parallel-size 2  \
-       --expert-model-parallel-size 1 \
-       --load ${CHECKPOINT}  \
-       --tokenizer-type Llama2Tokenizer \
-       --tokenizer-model $TOKENIZER_MODEL \
-       --use-mcore-models \
-       --max-position-embeddings 32768 \
-       --num-layers 32 \
-       --hidden-size 4096 \
-       --ffn-hidden-size 14336 \
-       --num-attention-heads 32 \
-       --normalization RMSNorm \
-       --disable-bias-linear \
-       --position-embedding-type rope \
-       --no-position-embedding \
-       --swiglu \
-       --untie-embeddings-and-output-weights \
-       --group-query-attention \
-       --num-query-groups 8 \
-       --bf16  \
-       --micro-batch-size 1  \
-       --seq-length 1024  \
-       --seed 42 \
-       --num-experts 8 \
-       --moe-router-topk 2 \
-       --moe-token-dispatcher-type alltoall \
-       --moe-grouped-gemm \
-       --mock-data \
-       --rotary-base 1000000
-```
-
-Once the server is running you can use `tools/text_generation_cli.py` to query it, it takes one argument which is the host the server is running on.
-
-```
-python tools/text_generation_cli.py localhost:5000
-```
-
-
-## Finetuning from pretrained Mixtral 8x7B
-To finetuning pretrained Mixtral 8x7B, use the following scripts:
-
-
-```bash
-PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:24.04-py3
-CHECKPOINT_PATH="" # Speicfy path to checkpoint dir
-TOKENIZER_MODEL="" # Specify path to tokenizer.model
-DATA_PATH="" # Specify path to data
-
-docker run \
-    --gpus=all \
-    --ipc=host \
-    --workdir /workspace/megatron-lm \
-    -v /path/to/data:/path/to/data \
-    -v /path/to/megatron-lm:/workspace/megatron-lm \
-    $PYTORCH_IMAGE \
-    bash examples/mixtral/train_mixtral_8x7b_distributed.sh $CHECKPOINT_PATH $TOKENIZER_MODEL $DATA_PATH
-```
-
-The above functionality also applys to Mixtral 8x22B actually, you should set the model config (including hidden_size/head_num/num_layers/ffn_hidden_size) properly according to the original [config](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1/blob/main/config.json).
-
-## Acknowledgements
-Contributors outside NVIDIA for the huggingface converter and example of Mixtral models in Megatron-Core:
-- Peng Li <jerry.lp@alibaba-inc.com>
-- Jun Huang <huangjun.hj@alibaba-inc.com>
diff --git a/examples/mixtral/train_mixtral_8x7b_distributed.sh b/examples/mixtral/train_mixtral_8x7b_distributed.sh
deleted file mode 100644
index ed44d60f5c0..00000000000
--- a/examples/mixtral/train_mixtral_8x7b_distributed.sh
+++ /dev/null
@@ -1,116 +0,0 @@
-#!/bin/bash
-
-# Runs Mixtral 8x7B model
-
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-GPUS_PER_NODE=8
-# Change for multinode config
-MASTER_ADDR=${MASTER_ADDR:-"localhost"}
-MASTER_PORT=${MASTER_PORT:-"6000"}
-NNODES=${SLURM_NNODES:-"1"}
-NODE_RANK=${RANK:-"0"}
-WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
-
-CHECKPOINT_PATH=$1
-TOKENIZER_MODEL=$2
-DATA_PATH=$3
-
-DISTRIBUTED_ARGS=(
-    --nproc_per_node $GPUS_PER_NODE
-    --nnodes $NNODES
-    --node_rank $NODE_RANK
-    --master_addr $MASTER_ADDR
-    --master_port $MASTER_PORT
-)
-
-MODEL_ARGS=(
-    --use-mcore-models
-    --disable-bias-linear
-    --seq-length 4096
-    --max-position-embeddings 32768
-    --num-layers 32
-    --hidden-size 4096
-    --ffn-hidden-size 14336
-    --num-attention-heads 32
-    --init-method-std 0.01
-    --attention-dropout 0.0
-    --hidden-dropout 0.0
-    --normalization RMSNorm
-    --position-embedding-type rope
-    --swiglu
-    --untie-embeddings-and-output-weights
-    --group-query-attention
-    --num-query-groups 8
-    --no-masked-softmax-fusion
-    --no-position-embedding
-    --rotary-base 1000000
-)
-
-MOE_ARGS=(
-    --num-experts 8
-    --moe-router-topk 2
-    --moe-router-load-balancing-type aux_loss
-    --moe-aux-loss-coeff 1e-2
-    --moe-grouped-gemm
-    --moe-token-dispatcher-type alltoall
-    --overlap-param-gather
-    --overlap-grad-reduce
-)
-
-DATA_ARGS=(
-    --tokenizer-type Llama2Tokenizer
-    --tokenizer-model ${TOKENIZER_MODEL}
-    --data-path $DATA_PATH
-    --split 99990,8,2
-)
-
-TRAINING_ARGS=(
-    --micro-batch-size 1
-    --global-batch-size 256
-    --lr 1e-4
-    --train-iters 500000
-    --lr-decay-iters 320000
-    --lr-decay-style cosine
-    --min-lr 1.0e-5
-    --weight-decay 0.1
-    --lr-warmup-iters 500
-    --clip-grad 1.0
-    --bf16
-)
-
-MODEL_PARALLEL_ARGS=(
-    --tensor-model-parallel-size 1
-    --pipeline-model-parallel-size 4
-    --expert-model-parallel-size 8
-    --use-distributed-optimizer
-    --sequence-parallel
-)
-
-LOGGING_ARGS=(
-    --log-interval 1 \
-    --save-interval 10000 \
-    --eval-interval 1000 \
-    --eval-iters 10 \
-    --save $CHECKPOINT_PATH \
-    --load $CHECKPOINT_PATH \
-    --tensorboard-dir "${CHECKPOINT_PATH}/tensorboard" \
-    --no-load-optim \
-    --no-load-rng
-)
-
-if [ -n "${WANDB_API_KEY}" ]; then
-    LOGGING_ARGS+=(
-        --wandb-project ${WANDB_PROJECT:-"Mixtral"}
-        --wandb-exp-name ${WANDB_NAME:-"Mixtral_8x7B"}
-    )
-fi
-
-
-torchrun ${DISTRIBUTED_ARGS[@]} pretrain_gpt.py \
-    ${MODEL_ARGS[@]} \
-    ${MOE_ARGS[@]} \
-    ${DATA_ARGS[@]} \
-    ${TRAINING_ARGS[@]} \
-    ${MODEL_PARALLEL_ARGS[@]} \
-    ${LOGGING_ARGS[@]}
diff --git a/examples/gpt3/README.md b/examples/open_models/gpt3/README.md
similarity index 91%
rename from examples/gpt3/README.md
rename to examples/open_models/gpt3/README.md
index 8d6f2674163..4bf11cdb7e1 100644
--- a/examples/gpt3/README.md
+++ b/examples/open_models/gpt3/README.md
@@ -24,7 +24,7 @@ docker run \
   -v /path/to/data:/path/to/data \
   -v /path/to/megatron-lm:/workspace/megatron-lm \
   megatron-lm nvcr.io/nvidia/pytorch:24.01-py3 \
-  bash examples/gpt3/train_gpt3_175b_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $MERGE_FILE $DATA_PATH "
+  bash examples/open_models/gpt3/train_gpt3_175b_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $MERGE_FILE $DATA_PATH "
 
 ```
 NOTE: Depending on the environment you are running it the above command might like slightly different.
diff --git a/examples/gpt3/gpt_config.yaml b/examples/open_models/gpt3/gpt_config.yaml
similarity index 100%
rename from examples/gpt3/gpt_config.yaml
rename to examples/open_models/gpt3/gpt_config.yaml
diff --git a/examples/gpt3/train_gpt3_175b_distributed.sh b/examples/open_models/gpt3/train_gpt3_175b_distributed.sh
similarity index 100%
rename from examples/gpt3/train_gpt3_175b_distributed.sh
rename to examples/open_models/gpt3/train_gpt3_175b_distributed.sh
diff --git a/examples/llama/README.md b/examples/open_models/llama/README.md
similarity index 97%
rename from examples/llama/README.md
rename to examples/open_models/llama/README.md
index 9872185ab2f..c4470cd4812 100644
--- a/examples/llama/README.md
+++ b/examples/open_models/llama/README.md
@@ -45,7 +45,7 @@ docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \
   -v "${HOST_TENSORBOARD_LOGS_PATH}:/workspace/tensorboard_logs" \
   --workdir /workspace/megatron-lm \
   $PYTORCH_IMAGE \
-  bash examples/llama/train_llama3_8b_h100_fp8.sh \
+  bash examples/open_models/llama/train_llama3_8b_h100_fp8.sh \
     /workspace/checkpoints \
     /workspace/tensorboard_logs \
   2>&1 | tee "${HOST_TENSORBOARD_LOGS_PATH}/training_mock_$(date +'%y-%m-%d_%H-%M-%S').log"
@@ -63,7 +63,7 @@ docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \
   -v "$(dirname "${HOST_DATA_PREFIX}"):/workspace/data_dir" \
   --workdir /workspace/megatron-lm \
   $PYTORCH_IMAGE \
-  bash examples/llama/train_llama3_8b_h100_fp8.sh \
+  bash examples/open_models/llama/train_llama3_8b_h100_fp8.sh \
     /workspace/checkpoints \
     /workspace/tensorboard_logs \
     /workspace/tokenizer_model \
diff --git a/examples/llama/train_llama3_8b_h100_fp8.sh b/examples/open_models/llama/train_llama3_8b_h100_fp8.sh
similarity index 100%
rename from examples/llama/train_llama3_8b_h100_fp8.sh
rename to examples/open_models/llama/train_llama3_8b_h100_fp8.sh
diff --git a/examples/mamba/.gitignore b/examples/open_models/mamba/.gitignore
similarity index 100%
rename from examples/mamba/.gitignore
rename to examples/open_models/mamba/.gitignore
diff --git a/examples/mamba/Dockerfile b/examples/open_models/mamba/Dockerfile
similarity index 100%
rename from examples/mamba/Dockerfile
rename to examples/open_models/mamba/Dockerfile
diff --git a/examples/mamba/README.md b/examples/open_models/mamba/README.md
similarity index 96%
rename from examples/mamba/README.md
rename to examples/open_models/mamba/README.md
index f8f6d796837..f26a3de8dfe 100644
--- a/examples/mamba/README.md
+++ b/examples/open_models/mamba/README.md
@@ -22,7 +22,7 @@ docker run --gpus all -it --rm \
   -v /path/to/megatron:/workspace/megatron \
   -v /path/to/dataset:/workspace/dataset \
   -v /path/to/checkpoints:/workspace/checkpoints \
-  -w /workspace/megatron/examples/mamba \
+  -w /workspace/megatron/examples/open_models/mamba \
   your_image_name:your_tag
 ```
 
@@ -55,7 +55,7 @@ including the hybrid layer configuration and model parallel configuration.
 
 If you need to convert a hybrid checkpoint file to a different tensor parallel
 or pipeline parallel size, use
-[the hybrid conversion script](../../tools/checkpoint/hybrid_conversion.py).
+[the hybrid conversion script](../../../tools/checkpoint/hybrid_conversion.py).
 There is an example run command at the end of that file.
 
 Before running that script, you will need to set `PYTHONPATH` to include the
diff --git a/examples/mamba/run_text_gen_server_8b.sh b/examples/open_models/mamba/run_text_gen_server_8b.sh
similarity index 89%
rename from examples/mamba/run_text_gen_server_8b.sh
rename to examples/open_models/mamba/run_text_gen_server_8b.sh
index 8d3137f2442..ef328fb0063 100755
--- a/examples/mamba/run_text_gen_server_8b.sh
+++ b/examples/open_models/mamba/run_text_gen_server_8b.sh
@@ -1,7 +1,7 @@
 #!/bin/bash
 
 # Use: ./run_text_gen_server_8b.sh <checkpoint-path> <tokenizer-path>
-# To launch the client: python ../../tools/text_generation_cli.py <URL-provided-by-server>
+# To launch the client: python ../../../tools/text_generation_cli.py <URL-provided-by-server>
 
 CHECKPOINT_PATH=$1
 TOKENIZER_PATH=$2
@@ -20,7 +20,7 @@ export NCCL_IB_QPS_PER_CONNECTION=4
 export TRITON_CACHE_DIR="./triton-cache/"
 export TRITON_CACHE_MANAGER="megatron.core.ssm.triton_cache_manager:ParallelFileCacheManager"
 
-torchrun $DISTRIBUTED_ARGS ../../tools/run_mamba_text_generation_server.py \
+torchrun $DISTRIBUTED_ARGS ../../../tools/run_mamba_text_generation_server.py \
        --tensor-model-parallel-size 1  \
        --pipeline-model-parallel-size 1  \
        --untie-embeddings-and-output-weights \
diff --git a/examples/mamba/run_text_gen_server_8b_gpt3.sh b/examples/open_models/mamba/run_text_gen_server_8b_gpt3.sh
similarity index 88%
rename from examples/mamba/run_text_gen_server_8b_gpt3.sh
rename to examples/open_models/mamba/run_text_gen_server_8b_gpt3.sh
index 5413b245ed3..f427a76dba5 100644
--- a/examples/mamba/run_text_gen_server_8b_gpt3.sh
+++ b/examples/open_models/mamba/run_text_gen_server_8b_gpt3.sh
@@ -1,7 +1,7 @@
 #!/bin/bash
 
 # Use: ./run_text_gen_server_8b_gpt3.sh <checkpoint-path> <tokenizer-path>
-# To launch the client: python ../../tools/text_generation_cli.py <URL-provided-by-server>
+# To launch the client: python ../../../tools/text_generation_cli.py <URL-provided-by-server>
 
 CHECKPOINT_PATH=$1
 TOKENIZER_PATH=$2
@@ -17,7 +17,7 @@ export CUDA_DEVICE_MAX_CONNECTIONS=1
 export NCCL_IB_TIMEOUT=19
 export NCCL_IB_QPS_PER_CONNECTION=4
 
-torchrun $DISTRIBUTED_ARGS ../../tools/run_text_generation_server.py \
+torchrun $DISTRIBUTED_ARGS ../../../tools/run_text_generation_server.py \
        --tensor-model-parallel-size 1  \
        --pipeline-model-parallel-size 1  \
        --use-flash-attn \
diff --git a/examples/mamba/train.sh b/examples/open_models/mamba/train.sh
similarity index 97%
rename from examples/mamba/train.sh
rename to examples/open_models/mamba/train.sh
index 3952a997d47..b6f780d09aa 100755
--- a/examples/mamba/train.sh
+++ b/examples/open_models/mamba/train.sh
@@ -102,4 +102,4 @@ options=" \
        --no-create-attention-mask-in-dataloader \
        --tensorboard-dir ${TENSORBOARD_DIR}"
 
-torchrun --nproc_per_node 8 ../../pretrain_mamba.py ${options}
+torchrun --nproc_per_node 8 ../../../pretrain_mamba.py ${options}
diff --git a/examples/mimo/__init__.py b/examples/open_models/mimo/__init__.py
similarity index 100%
rename from examples/mimo/__init__.py
rename to examples/open_models/mimo/__init__.py
diff --git a/examples/mimo/avlm_inference.py b/examples/open_models/mimo/avlm_inference.py
similarity index 100%
rename from examples/mimo/avlm_inference.py
rename to examples/open_models/mimo/avlm_inference.py
diff --git a/examples/mimo/configs/llava_avlm.py b/examples/open_models/mimo/configs/llava_avlm.py
similarity index 100%
rename from examples/mimo/configs/llava_avlm.py
rename to examples/open_models/mimo/configs/llava_avlm.py
diff --git a/examples/mimo/configs/llava_vlm.py b/examples/open_models/mimo/configs/llava_vlm.py
similarity index 100%
rename from examples/mimo/configs/llava_vlm.py
rename to examples/open_models/mimo/configs/llava_vlm.py
diff --git a/examples/mimo/configs/mock.py b/examples/open_models/mimo/configs/mock.py
similarity index 100%
rename from examples/mimo/configs/mock.py
rename to examples/open_models/mimo/configs/mock.py
diff --git a/examples/mimo/data/__init__.py b/examples/open_models/mimo/data/__init__.py
similarity index 100%
rename from examples/mimo/data/__init__.py
rename to examples/open_models/mimo/data/__init__.py
diff --git a/examples/mimo/data/avlm_sample_loader.py b/examples/open_models/mimo/data/avlm_sample_loader.py
similarity index 100%
rename from examples/mimo/data/avlm_sample_loader.py
rename to examples/open_models/mimo/data/avlm_sample_loader.py
diff --git a/examples/mimo/data/energon_avlm_task_encoder.py b/examples/open_models/mimo/data/energon_avlm_task_encoder.py
similarity index 100%
rename from examples/mimo/data/energon_avlm_task_encoder.py
rename to examples/open_models/mimo/data/energon_avlm_task_encoder.py
diff --git a/examples/mimo/data/energon_vlm_task_encoder.py b/examples/open_models/mimo/data/energon_vlm_task_encoder.py
similarity index 100%
rename from examples/mimo/data/energon_vlm_task_encoder.py
rename to examples/open_models/mimo/data/energon_vlm_task_encoder.py
diff --git a/examples/mimo/data/mock.py b/examples/open_models/mimo/data/mock.py
similarity index 100%
rename from examples/mimo/data/mock.py
rename to examples/open_models/mimo/data/mock.py
diff --git a/examples/mimo/data/prepare_video_llava_data.py b/examples/open_models/mimo/data/prepare_video_llava_data.py
similarity index 100%
rename from examples/mimo/data/prepare_video_llava_data.py
rename to examples/open_models/mimo/data/prepare_video_llava_data.py
diff --git a/examples/mimo/data/utils/calculate_audio_tokens.py b/examples/open_models/mimo/data/utils/calculate_audio_tokens.py
similarity index 100%
rename from examples/mimo/data/utils/calculate_audio_tokens.py
rename to examples/open_models/mimo/data/utils/calculate_audio_tokens.py
diff --git a/examples/mimo/model_providers/__init__.py b/examples/open_models/mimo/model_providers/__init__.py
similarity index 100%
rename from examples/mimo/model_providers/__init__.py
rename to examples/open_models/mimo/model_providers/__init__.py
diff --git a/examples/mimo/model_providers/hf_clip_encoder.py b/examples/open_models/mimo/model_providers/hf_clip_encoder.py
similarity index 100%
rename from examples/mimo/model_providers/hf_clip_encoder.py
rename to examples/open_models/mimo/model_providers/hf_clip_encoder.py
diff --git a/examples/mimo/model_providers/hf_whisper_encoder.py b/examples/open_models/mimo/model_providers/hf_whisper_encoder.py
similarity index 100%
rename from examples/mimo/model_providers/hf_whisper_encoder.py
rename to examples/open_models/mimo/model_providers/hf_whisper_encoder.py
diff --git a/examples/mimo/model_providers/llava_avlm.py b/examples/open_models/mimo/model_providers/llava_avlm.py
similarity index 100%
rename from examples/mimo/model_providers/llava_avlm.py
rename to examples/open_models/mimo/model_providers/llava_avlm.py
diff --git a/examples/mimo/model_providers/llava_vlm.py b/examples/open_models/mimo/model_providers/llava_vlm.py
similarity index 100%
rename from examples/mimo/model_providers/llava_vlm.py
rename to examples/open_models/mimo/model_providers/llava_vlm.py
diff --git a/examples/mimo/model_providers/mock.py b/examples/open_models/mimo/model_providers/mock.py
similarity index 100%
rename from examples/mimo/model_providers/mock.py
rename to examples/open_models/mimo/model_providers/mock.py
diff --git a/examples/mimo/scripts/run_avlm_train.sh b/examples/open_models/mimo/scripts/run_avlm_train.sh
similarity index 94%
rename from examples/mimo/scripts/run_avlm_train.sh
rename to examples/open_models/mimo/scripts/run_avlm_train.sh
index ced70cd0047..6b5cfbb359f 100755
--- a/examples/mimo/scripts/run_avlm_train.sh
+++ b/examples/open_models/mimo/scripts/run_avlm_train.sh
@@ -116,7 +116,7 @@ AUDIO_MODEL_ARGS=(
 if [ "$DEBUG_MODE" = true ]; then
   echo "Running in debug mode with $GPUS_PER_NODE GPU(s) per node..."
   echo "Debugger listening on port $DEBUG_PORT - connect with your IDE to this port"
-  debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+  debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
     ${TRAINING_ARGS[@]} \
     ${MODEL_PARALLEL_ARGS[@]} \
     ${EVAL_AND_LOGGING_ARGS[@]} \
@@ -128,7 +128,7 @@ else
   echo "Running in normal mode with $GPUS_PER_NODE GPU(s) per node..."
   if [ "$DRY_RUN" = true ]; then
     echo "Dry run mode enabled"
-    echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+    echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
     ${TRAINING_ARGS[@]} \
     ${MODEL_PARALLEL_ARGS[@]} \
     ${EVAL_AND_LOGGING_ARGS[@]} \
@@ -137,7 +137,7 @@ else
     ${AUDIO_MODEL_ARGS[@]} \
     ${DATASET_ARGS[@]}"
   else
-    torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+    torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
     ${TRAINING_ARGS[@]} \
     ${MODEL_PARALLEL_ARGS[@]} \
     ${EVAL_AND_LOGGING_ARGS[@]} \
diff --git a/examples/mimo/scripts/run_mock_train.sh b/examples/open_models/mimo/scripts/run_mock_train.sh
similarity index 91%
rename from examples/mimo/scripts/run_mock_train.sh
rename to examples/open_models/mimo/scripts/run_mock_train.sh
index 2ed71cd5ede..5970fe60767 100755
--- a/examples/mimo/scripts/run_mock_train.sh
+++ b/examples/open_models/mimo/scripts/run_mock_train.sh
@@ -1,7 +1,7 @@
 #!/bin/bash
 
 # from the root of the repo
-# ./examples/mimo/scripts/run_mock_train.sh
+# ./examples/open_models/mimo/scripts/run_mock_train.sh
 
 export CUDA_DEVICE_MAX_CONNECTIONS=1
 DRY_RUN=false
@@ -82,7 +82,7 @@ GPT_MODEL_ARGS=(
 if [ "$DEBUG_MODE" = true ]; then
   echo "Running in debug mode with $GPUS_PER_NODE GPU(s) per node..."
   echo "Debugger listening on port $DEBUG_PORT - connect with your IDE to this port"
-  debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+  debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
     ${TRAINING_ARGS[@]} \
     ${MODEL_PARALLEL_ARGS[@]} \
     ${EVAL_AND_LOGGING_ARGS[@]} \
@@ -92,14 +92,14 @@ else
   echo "Running in normal mode with $GPUS_PER_NODE GPU(s) per node..."
   if [ "$DRY_RUN" = true ]; then
     echo "Dry run mode enabled"
-    echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+    echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
     ${TRAINING_ARGS[@]} \
     ${MODEL_PARALLEL_ARGS[@]} \
     ${EVAL_AND_LOGGING_ARGS[@]} \
     ${TOKENIZER_ARGS[@]} \
     ${GPT_MODEL_ARGS[@]}"
   else
-    torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+    torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
     ${TRAINING_ARGS[@]} \
     ${MODEL_PARALLEL_ARGS[@]} \
     ${EVAL_AND_LOGGING_ARGS[@]} \
diff --git a/examples/mimo/scripts/run_video_vlm_train.sh b/examples/open_models/mimo/scripts/run_video_vlm_train.sh
similarity index 94%
rename from examples/mimo/scripts/run_video_vlm_train.sh
rename to examples/open_models/mimo/scripts/run_video_vlm_train.sh
index 2ec8af9d55f..ff63c1fc495 100755
--- a/examples/mimo/scripts/run_video_vlm_train.sh
+++ b/examples/open_models/mimo/scripts/run_video_vlm_train.sh
@@ -110,7 +110,7 @@ GPT_MODEL_ARGS=(
 if [ "$DEBUG_MODE" = true ]; then
   echo "Running in debug mode with $GPUS_PER_NODE GPU(s) per node..."
   echo "Debugger listening on port $DEBUG_PORT - connect with your IDE to this port"
-  debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+  debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
     ${TRAINING_ARGS[@]} \
     ${MODEL_PARALLEL_ARGS[@]} \
     ${EVAL_AND_LOGGING_ARGS[@]} \
@@ -121,7 +121,7 @@ else
   echo "Running in normal mode with $GPUS_PER_NODE GPU(s) per node..."
   if [ "$DRY_RUN" = true ]; then
     echo "Dry run mode enabled"
-    echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+    echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
     ${TRAINING_ARGS[@]} \
     ${MODEL_PARALLEL_ARGS[@]} \
     ${EVAL_AND_LOGGING_ARGS[@]} \
@@ -129,7 +129,7 @@ else
     ${GPT_MODEL_ARGS[@]} \
     ${DATASET_ARGS[@]}"
   else
-    torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+    torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
     ${TRAINING_ARGS[@]} \
     ${MODEL_PARALLEL_ARGS[@]} \
     ${EVAL_AND_LOGGING_ARGS[@]} \
diff --git a/examples/mimo/scripts/run_vlm_train.sh b/examples/open_models/mimo/scripts/run_vlm_train.sh
similarity index 94%
rename from examples/mimo/scripts/run_vlm_train.sh
rename to examples/open_models/mimo/scripts/run_vlm_train.sh
index 702b29795a4..bc95f9cc6c1 100755
--- a/examples/mimo/scripts/run_vlm_train.sh
+++ b/examples/open_models/mimo/scripts/run_vlm_train.sh
@@ -111,7 +111,7 @@ GPT_MODEL_ARGS=(
 if [ "$DEBUG_MODE" = true ]; then
   echo "Running in debug mode with $GPUS_PER_NODE GPU(s) per node..."
   echo "Debugger listening on port $DEBUG_PORT - connect with your IDE to this port"
-  debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+  debugpy-run -p :$DEBUG_PORT -m torch.distributed.run -- ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
     ${TRAINING_ARGS[@]} \
     ${MODEL_PARALLEL_ARGS[@]} \
     ${EVAL_AND_LOGGING_ARGS[@]} \
@@ -122,7 +122,7 @@ else
   echo "Running in normal mode with $GPUS_PER_NODE GPU(s) per node..."
   if [ "$DRY_RUN" = true ]; then
     echo "Dry run mode enabled"
-    echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+    echo "torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
     ${TRAINING_ARGS[@]} \
     ${MODEL_PARALLEL_ARGS[@]} \
     ${EVAL_AND_LOGGING_ARGS[@]} \
@@ -130,7 +130,7 @@ else
     ${GPT_MODEL_ARGS[@]} \
     ${DATASET_ARGS[@]}"
   else
-    torchrun ${DISTRIBUTED_ARGS[@]} examples/mimo/train.py \
+    torchrun ${DISTRIBUTED_ARGS[@]} examples/open_models/mimo/train.py \
     ${TRAINING_ARGS[@]} \
     ${MODEL_PARALLEL_ARGS[@]} \
     ${EVAL_AND_LOGGING_ARGS[@]} \
diff --git a/examples/mimo/train.py b/examples/open_models/mimo/train.py
similarity index 100%
rename from examples/mimo/train.py
rename to examples/open_models/mimo/train.py
diff --git a/examples/mimo/utils/__init__.py b/examples/open_models/mimo/utils/__init__.py
similarity index 100%
rename from examples/mimo/utils/__init__.py
rename to examples/open_models/mimo/utils/__init__.py
diff --git a/examples/mimo/utils/data_helpers.py b/examples/open_models/mimo/utils/data_helpers.py
similarity index 100%
rename from examples/mimo/utils/data_helpers.py
rename to examples/open_models/mimo/utils/data_helpers.py
diff --git a/examples/mimo/utils/logging.py b/examples/open_models/mimo/utils/logging.py
similarity index 100%
rename from examples/mimo/utils/logging.py
rename to examples/open_models/mimo/utils/logging.py
diff --git a/examples/mimo/utils/model_helpers.py b/examples/open_models/mimo/utils/model_helpers.py
similarity index 100%
rename from examples/mimo/utils/model_helpers.py
rename to examples/open_models/mimo/utils/model_helpers.py
diff --git a/examples/t5/README.md b/examples/open_models/t5/README.md
similarity index 93%
rename from examples/t5/README.md
rename to examples/open_models/t5/README.md
index 205da1db370..8fcf7b96c59 100644
--- a/examples/t5/README.md
+++ b/examples/open_models/t5/README.md
@@ -21,7 +21,7 @@ DATA_PATH="" #<Specify path and file prefix>_text_document
 
 srun -N $NUM_NODES --container-image $PYTORCH_IMAGE --container-mounts "/path/to/data:/path/to/data,/path/to/megatron-lm:/workspace/megatron-lm" --account $ACCOUNT -N 1 -J $JOB_NAME  -p $PARTITION --no-container-mount-home  -c "
   cd /workspace/megatron-lm
-  ./examples/t5/train_t5_220m_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $DATA_PATH"
+  ./examples/open_models/t5/train_t5_220m_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $DATA_PATH"
 
 ```
 
diff --git a/examples/t5/t5_mcore_train_curve.png b/examples/open_models/t5/t5_mcore_train_curve.png
similarity index 100%
rename from examples/t5/t5_mcore_train_curve.png
rename to examples/open_models/t5/t5_mcore_train_curve.png
diff --git a/examples/t5/train_t5_220m_distributed.sh b/examples/open_models/t5/train_t5_220m_distributed.sh
similarity index 100%
rename from examples/t5/train_t5_220m_distributed.sh
rename to examples/open_models/t5/train_t5_220m_distributed.sh
diff --git a/megatron/core/README.md b/megatron/core/README.md
index a9134be41cd..4fae09d9909 100644
--- a/megatron/core/README.md
+++ b/megatron/core/README.md
@@ -40,7 +40,7 @@ torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
 - **[Simple Training Loop](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/run_simple_mcore_train_loop.py)** - Basic usage
 - **[Multimodal Training](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/multimodal/)** - Vision-language models
 - **[Mixture-of-Experts](https://github.com/yanring/Megatron-MoE-ModelZoo)** - MoE examples
-- **[Mamba Models](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/mamba/)** - State-space models
+- **[Mamba Models](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/open_models/mamba/)** - State-space models
 
 **Documentation:**
 - **[📚 API Guide](https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/index.html)** - Complete API documentation
diff --git a/tests/test_utils/recipes/h100/mimo.yaml b/tests/test_utils/recipes/h100/mimo.yaml
index 88b17815ede..5614f690682 100644
--- a/tests/test_utils/recipes/h100/mimo.yaml
+++ b/tests/test_utils/recipes/h100/mimo.yaml
@@ -50,7 +50,7 @@ spec:
         "TENSORBOARD_PATH={assets_dir}/tensorboard"
         "CHECKPOINT_SAVE_PATH={artifacts_dir}/checkpoints"
         "CHECKPOINT_LOAD_PATH=/mnt/artifacts"
-        "TRAINING_SCRIPT_PATH=./examples/mimo/train.py"
+        "TRAINING_SCRIPT_PATH=./examples/open_models/mimo/train.py"
         "TRAINING_PARAMS_PATH=./tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml"
         "GOLDEN_VALUES_PATH=./tests/functional_tests/test_cases/{model}/{test_case}/golden_values_{environment}_{platforms}.json"
         "N_REPEAT={n_repeat}"