-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Checklist / 检查清单
- I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。
Bug Description / Bug 描述
[INFO:swift] [rank3] model_parameter_info: Qwen3VLGPTModel: 6736.6098M Params (3068.9936M Trainable [45.5569%]), 1.9202M Buffers.
[INFO:swift] [rank0] model_parameter_info: Qwen3VLGPTModel: 6736.6098M Params (3068.9936M Trainable [45.5569%]), 1.9202M Buffers.
[INFO:swift] [rank2] model_parameter_info: Qwen3VLGPTModel: 6736.6098M Params (3068.9936M Trainable [45.5569%]), 1.9202M Buffers.
> number of parameters on (tensor, pipeline) model parallel rank (3, 0): 6736609841
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 6736609841
> number of parameters on (tensor, pipeline) model parallel rank (2, 0): 6736609841
[rank11]: Traceback (most recent call last):
[rank11]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 7, in <module>
[rank11]: megatron_sft_main()
[rank11]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 87, in megatron_sft_main
[rank11]: return MegatronSft(args).main()
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/llm/base.py", line 49, in main
[rank11]: result = self.run()
[rank11]: ^^^^^^^^^^
[rank11]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 77, in run
[rank11]: self.trainer.train(train_dataset, val_dataset, data_collator)
[rank11]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 1098, in train
[rank11]: pretrain(
[rank11]: File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 666, in pretrain
[rank11]: model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 498, in setup_model_and_optimizer
[rank11]: model, optimizer, opt_param_scheduler = self._origin_setup_model_and_optimizer(
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1115, in setup_model_and_optimizer
[rank11]: opt_param_scheduler = get_optimizer_param_scheduler(optimizer)
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1047, in get_optimizer_param_scheduler
[rank11]: raise Exception('either train-iters or train-samples should be provided.')
[rank11]: Exception: either train-iters or train-samples should be provided.
[rank14]: Traceback (most recent call last):
[rank14]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 7, in <module>
[rank14]: megatron_sft_main()
[rank14]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 87, in megatron_sft_main
[rank14]: return MegatronSft(args).main()
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/llm/base.py", line 49, in main
[rank14]: result = self.run()
[rank14]: ^^^^^^^^^^
[rank14]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 77, in run
[rank14]: self.trainer.train(train_dataset, val_dataset, data_collator)
[rank14]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 1098, in train
[rank14]: pretrain(
[rank14]: File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 666, in pretrain
[rank14]: model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 498, in setup_model_and_optimizer
[rank14]: model, optimizer, opt_param_scheduler = self._origin_setup_model_and_optimizer(
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1115, in setup_model_and_optimizer
[rank14]: opt_param_scheduler = get_optimizer_param_scheduler(optimizer)
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1047, in get_optimizer_param_scheduler
[rank14]: raise Exception('either train-iters or train-samples should be provided.')
[rank14]: Exception: either train-iters or train-samples should be provided.
...
为什么呢? 怎么修复这个错误?
How to Reproduce / 如何复现
Hello,我在阿里云PAI平台,提交一个4机(每台机器8卡H20)的DLC训练任务,训练脚本内容如下:
# set -x
ROOTDIR=/cpfs/user/gongzhenting
export PATH="${ROOTDIR}/tools/anaconda3/bin:$PATH"
# >>> conda initialize >>>
conda_root=${ROOTDIR}/tools/anaconda3
__conda_setup="$('$conda_root/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "$conda_root/etc/profile.d/conda.sh" ]; then
. "$conda_root/etc/profile.d/conda.sh"
else
export PATH="$conda_root/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
conda activate ms_swift
# 添加CUDA环境变量设置
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
nvcc --version # 验证CUDA是否可用
# 基础路径配置
WORKDIR=${ROOTDIR}/MLLM/Alem2.0/
cd ${WORKDIR}
# SwanLab 实验记录环境配置(同步到私有化 SwanLab 服务)
export SWANLAB_LOG_DIR=$WORKDIR/logs/swanlab_logs
export SWANLAB_API_KEY="xxx"
export SWANLAB_API_URL="http://yyy:8000"
# 强制重置本地配置,锁定到私有化地址
swanlab login --api-key $SWANLAB_API_KEY --host $SWANLAB_API_URL --relogin
# 环境配置
export USE_HF=1
export HF_ENDPOINT=https://hf-mirror.com
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# INFO 会刷屏大量 Channel/P2P 等日志,改为 WARN 只保留警告和错误
export NCCL_DEBUG=WARN
export NCCL_IB_DISABLE=0 # 确保 IB 开启
export TQDM_INTERVAL=1 # 强制进度条刷新
export TORCH_HOME=${ROOTDIR}/.cache/torch
export TRITON_CACHE_PATH=${ROOTDIR}/.cache/triton
# 模型参数配置
export MAX_PIXELS=1003520
export VIDEO_MAX_PIXELS=50176
export FPS_MAX_FRAMES=12
export SWIFT_PATCH_CONV3D=1
# 模型训练相关路径配置
# 使用本地 Megatron-LM,避免 Swift 对 ModelScope 缓存执行 git fetch
export MEGATRON_LM_PATH=${ROOTDIR}/tools/training-tools/Megatron-LM
export CONFIG=${CONFIG:=configs/swift/examples/sft/mydataset_32gpus_lr1e-4.yaml}
export dataset_name=mydataset
export modelname=Qwen3-Omni-30B-A3B-Instruct
export sft_type=full
export logdir=${WORKDIR}/logs/${modelname}_sft_${sft_type}/${dataset_name}
mkdir -p ${logdir}
LOGFILE=${logdir}/$(date +%Y-%m-%d_%H-%M-%S).log
# --- 节点参数设置 ---
# # --- 本地单机 8 卡参数设置 ---
# export NNODES=1
# export NODE_RANK=0
# export MASTER_ADDR=localhost
# export MASTER_PORT=60001
export NPROC_PER_NODE=8
# 执行训练
NNODES=${WORLD_SIZE} \
NODE_RANK=${RANK} \
MASTER_ADDR=${MASTER_ADDR} \
MASTER_PORT=${MASTER_PORT} \
NPROC_PER_NODE=${NPROC_PER_NODE} \
megatron sft --config $CONFIG \
--report_to swanlab \
--wandb_project Qwen3-Omni_sft_${dataset_name} \
--wandb_save_dir ${WORKDIR}/logs/swanlab_logs
echo "Training started in background. Log: ${LOGFILE} PID: $!"
其中,configs/swift/examples/sft/mydataset_32gpus_lr1e-4.yaml的内容如下:
### model
model: /zzz/hf_models/Qwen3-Omni-30B-A3B-Instruct
attn_impl: flash_attn
load_safetensors: true
save_safetensors: true
# 并行策略:每个模型副本占用 8 张卡(1台机)
# 总卡数 32 / (4*2) = 4 个数据并行组 (DP=4)
tensor_model_parallel_size: 4
pipeline_model_parallel_size: 2
sequence_parallel: true
context_parallel_size: 1
expert_model_parallel_size: 4 # 增加 EP 提高 MoE 计算效率
### method
train_type: full
freeze_llm: false
freeze_vit: false
freeze_aligner: true
# 与 8gpus 一致:加强重计算省显存,避免 optimizer 分配时 OOM
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 4
cross_entropy_loss_fusion: true
### dataset
dataset:
- mydataset
custom_dataset_info: configs/swift/llm/dataset/data/custom_dataset_info.json
num_workers: 16
dataset_num_proc: 64
max_length: 16384
packing: true
### output
save: trained_model/Qwen3-Omni-30B-A3B-Instruct_sft_full/mydataset
save_interval: 200
no_save_optim: true
no_save_rng: true
### train
# 多机 DLC 下须显式设置 train_iters,否则 get_optimizer_param_scheduler 会报错;有 max_epochs 时 Swift 会按数据集长度覆盖
train_iters: 5000
max_epochs: 2
padding_free: false
bf16: true
# 优化点:H20 显存 96G 建议尝试 MBS=2,若 OOM 再退回 1
micro_batch_size: 2
# 计算:GBS = DP(4) * MBS(2) * 梯度累加(32) = 256
global_batch_size: 256
gradient_checkpointing: true
vit_gradient_checkpointing: true
# VIT 的 gradient_checkpointing_enable() 不支持 use_reentrant,留空避免报错
gradient_checkpointing_kwargs: {}
# 优化器省显存:CPU 卸载需与 use_precision_aware_optimizer 同开
use_precision_aware_optimizer: true
optimizer_cpu_offload: true
optimizer_offload_fraction: 1.0
# main_grads_dtype: bf16
# exp_avg_dtype: bf16
# exp_avg_sq_dtype: bf16
lr: 1e-4
lr_warmup_fraction: 0.05
min_lr: 1e-6
finetune: true
### logging (SwanLab,脚本会传入 wandb_project / wandb_save_dir)
report_to: swanlab
### eval
eval_interval: 200
split_dataset_ratio: 0.01
### moe
moe_permute_fusion: true
moe_grouped_gemm: true
moe_shared_expert_overlap: true
moe_aux_loss_coeff: 1e-6
可以看到yaml配置文件中我明明写了train_iters: 5000,但是DLC任务启动训练后,依旧报错:
[INFO:swift] [rank3] model_parameter_info: Qwen3VLGPTModel: 6736.6098M Params (3068.9936M Trainable [45.5569%]), 1.9202M Buffers.
[INFO:swift] [rank0] model_parameter_info: Qwen3VLGPTModel: 6736.6098M Params (3068.9936M Trainable [45.5569%]), 1.9202M Buffers.
[INFO:swift] [rank2] model_parameter_info: Qwen3VLGPTModel: 6736.6098M Params (3068.9936M Trainable [45.5569%]), 1.9202M Buffers.
> number of parameters on (tensor, pipeline) model parallel rank (3, 0): 6736609841
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 6736609841
> number of parameters on (tensor, pipeline) model parallel rank (2, 0): 6736609841
[rank11]: Traceback (most recent call last):
[rank11]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 7, in <module>
[rank11]: megatron_sft_main()
[rank11]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 87, in megatron_sft_main
[rank11]: return MegatronSft(args).main()
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/llm/base.py", line 49, in main
[rank11]: result = self.run()
[rank11]: ^^^^^^^^^^
[rank11]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 77, in run
[rank11]: self.trainer.train(train_dataset, val_dataset, data_collator)
[rank11]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 1098, in train
[rank11]: pretrain(
[rank11]: File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 666, in pretrain
[rank11]: model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 498, in setup_model_and_optimizer
[rank11]: model, optimizer, opt_param_scheduler = self._origin_setup_model_and_optimizer(
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1115, in setup_model_and_optimizer
[rank11]: opt_param_scheduler = get_optimizer_param_scheduler(optimizer)
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1047, in get_optimizer_param_scheduler
[rank11]: raise Exception('either train-iters or train-samples should be provided.')
[rank11]: Exception: either train-iters or train-samples should be provided.
[rank14]: Traceback (most recent call last):
[rank14]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 7, in <module>
[rank14]: megatron_sft_main()
[rank14]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 87, in megatron_sft_main
[rank14]: return MegatronSft(args).main()
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/llm/base.py", line 49, in main
[rank14]: result = self.run()
[rank14]: ^^^^^^^^^^
[rank14]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 77, in run
[rank14]: self.trainer.train(train_dataset, val_dataset, data_collator)
[rank14]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 1098, in train
[rank14]: pretrain(
[rank14]: File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 666, in pretrain
[rank14]: model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 498, in setup_model_and_optimizer
[rank14]: model, optimizer, opt_param_scheduler = self._origin_setup_model_and_optimizer(
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1115, in setup_model_and_optimizer
[rank14]: opt_param_scheduler = get_optimizer_param_scheduler(optimizer)
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1047, in get_optimizer_param_scheduler
[rank14]: raise Exception('either train-iters or train-samples should be provided.')
[rank14]: Exception: either train-iters or train-samples should be provided.
...
为什么呢? 怎么修复这个错误?
Additional Information / 补充信息
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working