Skip to content

Exception: either train-iters or train-samples should be provided. #8052

@IamMegatron2025

Description

@IamMegatron2025

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

[INFO:swift] [rank3] model_parameter_info: Qwen3VLGPTModel: 6736.6098M Params (3068.9936M Trainable [45.5569%]), 1.9202M Buffers.
[INFO:swift] [rank0] model_parameter_info: Qwen3VLGPTModel: 6736.6098M Params (3068.9936M Trainable [45.5569%]), 1.9202M Buffers.
[INFO:swift] [rank2] model_parameter_info: Qwen3VLGPTModel: 6736.6098M Params (3068.9936M Trainable [45.5569%]), 1.9202M Buffers.
 > number of parameters on (tensor, pipeline) model parallel rank (3, 0): 6736609841
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 6736609841
 > number of parameters on (tensor, pipeline) model parallel rank (2, 0): 6736609841
[rank11]: Traceback (most recent call last):
[rank11]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 7, in <module>
[rank11]:     megatron_sft_main()
[rank11]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 87, in megatron_sft_main
[rank11]:     return MegatronSft(args).main()
[rank11]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/llm/base.py", line 49, in main
[rank11]:     result = self.run()
[rank11]:              ^^^^^^^^^^
[rank11]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 77, in run
[rank11]:     self.trainer.train(train_dataset, val_dataset, data_collator)
[rank11]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 1098, in train
[rank11]:     pretrain(
[rank11]:   File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 666, in pretrain
[rank11]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
[rank11]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 498, in setup_model_and_optimizer
[rank11]:     model, optimizer, opt_param_scheduler = self._origin_setup_model_and_optimizer(
[rank11]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]:   File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1115, in setup_model_and_optimizer
[rank11]:     opt_param_scheduler = get_optimizer_param_scheduler(optimizer)
[rank11]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]:   File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1047, in get_optimizer_param_scheduler
[rank11]:     raise Exception('either train-iters or train-samples should be provided.')
[rank11]: Exception: either train-iters or train-samples should be provided.
[rank14]: Traceback (most recent call last):
[rank14]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 7, in <module>
[rank14]:     megatron_sft_main()
[rank14]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 87, in megatron_sft_main
[rank14]:     return MegatronSft(args).main()
[rank14]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/llm/base.py", line 49, in main
[rank14]:     result = self.run()
[rank14]:              ^^^^^^^^^^
[rank14]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 77, in run
[rank14]:     self.trainer.train(train_dataset, val_dataset, data_collator)
[rank14]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 1098, in train
[rank14]:     pretrain(
[rank14]:   File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 666, in pretrain
[rank14]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
[rank14]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 498, in setup_model_and_optimizer
[rank14]:     model, optimizer, opt_param_scheduler = self._origin_setup_model_and_optimizer(
[rank14]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]:   File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1115, in setup_model_and_optimizer
[rank14]:     opt_param_scheduler = get_optimizer_param_scheduler(optimizer)
[rank14]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]:   File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1047, in get_optimizer_param_scheduler
[rank14]:     raise Exception('either train-iters or train-samples should be provided.')
[rank14]: Exception: either train-iters or train-samples should be provided.
...

为什么呢? 怎么修复这个错误?

How to Reproduce / 如何复现

Hello,我在阿里云PAI平台,提交一个4机(每台机器8卡H20)的DLC训练任务,训练脚本内容如下:

# set -x
ROOTDIR=/cpfs/user/gongzhenting
export PATH="${ROOTDIR}/tools/anaconda3/bin:$PATH"

# >>> conda initialize >>>
conda_root=${ROOTDIR}/tools/anaconda3
__conda_setup="$('$conda_root/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "$conda_root/etc/profile.d/conda.sh" ]; then
        . "$conda_root/etc/profile.d/conda.sh"
    else
        export PATH="$conda_root/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<
conda activate ms_swift

# 添加CUDA环境变量设置
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
nvcc --version  # 验证CUDA是否可用

# 基础路径配置
WORKDIR=${ROOTDIR}/MLLM/Alem2.0/
cd ${WORKDIR}

# SwanLab 实验记录环境配置(同步到私有化 SwanLab 服务)
export SWANLAB_LOG_DIR=$WORKDIR/logs/swanlab_logs
export SWANLAB_API_KEY="xxx"
export SWANLAB_API_URL="http://yyy:8000"
# 强制重置本地配置,锁定到私有化地址
swanlab login --api-key $SWANLAB_API_KEY --host $SWANLAB_API_URL --relogin

# 环境配置
export USE_HF=1
export HF_ENDPOINT=https://hf-mirror.com
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# INFO 会刷屏大量 Channel/P2P 等日志,改为 WARN 只保留警告和错误
export NCCL_DEBUG=WARN
export NCCL_IB_DISABLE=0 # 确保 IB 开启
export TQDM_INTERVAL=1  # 强制进度条刷新
export TORCH_HOME=${ROOTDIR}/.cache/torch
export TRITON_CACHE_PATH=${ROOTDIR}/.cache/triton

# 模型参数配置
export MAX_PIXELS=1003520
export VIDEO_MAX_PIXELS=50176
export FPS_MAX_FRAMES=12
export SWIFT_PATCH_CONV3D=1

# 模型训练相关路径配置
# 使用本地 Megatron-LM,避免 Swift 对 ModelScope 缓存执行 git fetch
export MEGATRON_LM_PATH=${ROOTDIR}/tools/training-tools/Megatron-LM
export CONFIG=${CONFIG:=configs/swift/examples/sft/mydataset_32gpus_lr1e-4.yaml}
export dataset_name=mydataset
export modelname=Qwen3-Omni-30B-A3B-Instruct
export sft_type=full
export logdir=${WORKDIR}/logs/${modelname}_sft_${sft_type}/${dataset_name}
mkdir -p ${logdir}
LOGFILE=${logdir}/$(date +%Y-%m-%d_%H-%M-%S).log

# --- 节点参数设置 ---
# # --- 本地单机 8 卡参数设置 ---
# export NNODES=1
# export NODE_RANK=0
# export MASTER_ADDR=localhost
# export MASTER_PORT=60001
export NPROC_PER_NODE=8

# 执行训练

NNODES=${WORLD_SIZE} \
NODE_RANK=${RANK} \
MASTER_ADDR=${MASTER_ADDR} \
MASTER_PORT=${MASTER_PORT}  \
NPROC_PER_NODE=${NPROC_PER_NODE} \
megatron sft --config $CONFIG \
    --report_to swanlab \
    --wandb_project Qwen3-Omni_sft_${dataset_name} \
    --wandb_save_dir ${WORKDIR}/logs/swanlab_logs
echo "Training started in background. Log: ${LOGFILE}  PID: $!"

其中,configs/swift/examples/sft/mydataset_32gpus_lr1e-4.yaml的内容如下:

### model
model: /zzz/hf_models/Qwen3-Omni-30B-A3B-Instruct
attn_impl: flash_attn
load_safetensors: true
save_safetensors: true
# 并行策略:每个模型副本占用 8 张卡(1台机)
# 总卡数 32 / (4*2) = 4 个数据并行组 (DP=4)
tensor_model_parallel_size: 4
pipeline_model_parallel_size: 2
sequence_parallel: true
context_parallel_size: 1
expert_model_parallel_size: 4  # 增加 EP 提高 MoE 计算效率

### method
train_type: full
freeze_llm: false
freeze_vit: false
freeze_aligner: true

# 与 8gpus 一致:加强重计算省显存,避免 optimizer 分配时 OOM
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 4
cross_entropy_loss_fusion: true

### dataset
dataset:
  - mydataset

custom_dataset_info: configs/swift/llm/dataset/data/custom_dataset_info.json
num_workers: 16
dataset_num_proc: 64
max_length: 16384
packing: true

### output
save: trained_model/Qwen3-Omni-30B-A3B-Instruct_sft_full/mydataset
save_interval: 200
no_save_optim: true
no_save_rng: true

### train
# 多机 DLC 下须显式设置 train_iters,否则 get_optimizer_param_scheduler 会报错;有 max_epochs 时 Swift 会按数据集长度覆盖
train_iters: 5000
max_epochs: 2
padding_free: false
bf16: true
# 优化点:H20 显存 96G 建议尝试 MBS=2,若 OOM 再退回 1
micro_batch_size: 2
# 计算:GBS = DP(4) * MBS(2) * 梯度累加(32) = 256
global_batch_size: 256 
gradient_checkpointing: true
vit_gradient_checkpointing: true
# VIT 的 gradient_checkpointing_enable() 不支持 use_reentrant,留空避免报错
gradient_checkpointing_kwargs: {}
# 优化器省显存:CPU 卸载需与 use_precision_aware_optimizer 同开
use_precision_aware_optimizer: true
optimizer_cpu_offload: true
optimizer_offload_fraction: 1.0
# main_grads_dtype: bf16
# exp_avg_dtype: bf16
# exp_avg_sq_dtype: bf16

lr: 1e-4
lr_warmup_fraction: 0.05
min_lr: 1e-6
finetune: true

### logging (SwanLab,脚本会传入 wandb_project / wandb_save_dir)
report_to: swanlab

### eval
eval_interval: 200
split_dataset_ratio: 0.01

### moe
moe_permute_fusion: true
moe_grouped_gemm: true
moe_shared_expert_overlap: true
moe_aux_loss_coeff: 1e-6

可以看到yaml配置文件中我明明写了train_iters: 5000,但是DLC任务启动训练后,依旧报错:

[INFO:swift] [rank3] model_parameter_info: Qwen3VLGPTModel: 6736.6098M Params (3068.9936M Trainable [45.5569%]), 1.9202M Buffers.
[INFO:swift] [rank0] model_parameter_info: Qwen3VLGPTModel: 6736.6098M Params (3068.9936M Trainable [45.5569%]), 1.9202M Buffers.
[INFO:swift] [rank2] model_parameter_info: Qwen3VLGPTModel: 6736.6098M Params (3068.9936M Trainable [45.5569%]), 1.9202M Buffers.
 > number of parameters on (tensor, pipeline) model parallel rank (3, 0): 6736609841
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 6736609841
 > number of parameters on (tensor, pipeline) model parallel rank (2, 0): 6736609841
[rank11]: Traceback (most recent call last):
[rank11]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 7, in <module>
[rank11]:     megatron_sft_main()
[rank11]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 87, in megatron_sft_main
[rank11]:     return MegatronSft(args).main()
[rank11]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/llm/base.py", line 49, in main
[rank11]:     result = self.run()
[rank11]:              ^^^^^^^^^^
[rank11]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 77, in run
[rank11]:     self.trainer.train(train_dataset, val_dataset, data_collator)
[rank11]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 1098, in train
[rank11]:     pretrain(
[rank11]:   File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 666, in pretrain
[rank11]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
[rank11]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 498, in setup_model_and_optimizer
[rank11]:     model, optimizer, opt_param_scheduler = self._origin_setup_model_and_optimizer(
[rank11]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]:   File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1115, in setup_model_and_optimizer
[rank11]:     opt_param_scheduler = get_optimizer_param_scheduler(optimizer)
[rank11]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]:   File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1047, in get_optimizer_param_scheduler
[rank11]:     raise Exception('either train-iters or train-samples should be provided.')
[rank11]: Exception: either train-iters or train-samples should be provided.
[rank14]: Traceback (most recent call last):
[rank14]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 7, in <module>
[rank14]:     megatron_sft_main()
[rank14]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 87, in megatron_sft_main
[rank14]:     return MegatronSft(args).main()
[rank14]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/llm/base.py", line 49, in main
[rank14]:     result = self.run()
[rank14]:              ^^^^^^^^^^
[rank14]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 77, in run
[rank14]:     self.trainer.train(train_dataset, val_dataset, data_collator)
[rank14]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 1098, in train
[rank14]:     pretrain(
[rank14]:   File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 666, in pretrain
[rank14]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
[rank14]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]:   File "/cpfs/user/gongzhenting/tools/anaconda3/envs/ms_swift/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 498, in setup_model_and_optimizer
[rank14]:     model, optimizer, opt_param_scheduler = self._origin_setup_model_and_optimizer(
[rank14]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]:   File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1115, in setup_model_and_optimizer
[rank14]:     opt_param_scheduler = get_optimizer_param_scheduler(optimizer)
[rank14]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]:   File "/root/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1047, in get_optimizer_param_scheduler
[rank14]:     raise Exception('either train-iters or train-samples should be provided.')
[rank14]: Exception: either train-iters or train-samples should be provided.
...

为什么呢? 怎么修复这个错误?

Additional Information / 补充信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions