-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Checklist / 检查清单
- I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。
Bug Description / Bug 描述
训练qwen3-vl的时候,zero1是正常的,zero3会报错:AttributeError: 'Linear' object has no attribute 'ds_grads_remaining'
How to Reproduce / 如何复现
Docker:modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.8.1-py311-torch2.9.0-vllm0.13.0-modelscope1.33.0-swift3.12.3
我直接用的ms-swift的docker作为环境,没用pip install改过任何包。
训练代码如下:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
FPS_MAX_FRAMES=160 \
FPS=4 \
swift sft \
--model /yke/models/Qwen3-VL-32B-Instruct \
--train_type lora \
--dataset train_with_prompt.jsonl \
--torch_dtype bfloat16 \
--dataset_num_proc 8 \
--dataloader_num_workers 4 \
--dataset_shuffle true \
--split_dataset_ratio 0.1 \
--seed 42 \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--warmup_ratio 0.1 \
--max_grad_norm 1 \
--weight_decay 0.01 \
--lr_scheduler_type cosine \
--learning_rate 1e-4 \
--target_modules all-linear \
--lora_rank 128 \
--lora_alpha 256 \
--lora_dropout 0.0 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 5 \
--logging_steps 5 \
--max_length 40960 \
--output_dir output \
--attn_impl flash_attn \
--use_liger_kernel false \
--gradient_checkpointing true \
--gradient_checkpointing_kwargs '{"use_reentrant": false}' \
--num_labels 1 \
--problem_type regression \
--task_type seq_cls \
--use_chat_template true \
--deepspeed zero3
Additional Information / 补充信息
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working