Skip to content

Qwen3-VL-30B-A3B-Instruct在GRPO LoRA训练时,模型保存报错 #8049

@lizechng

Description

@lizechng

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

[2026-02-13 23:08:04] iteration 2/ 2 | consumed samples: 128 | elapsed time per iteration (ms): 36552.1 | memory(GiB): 52.07 | elapsed time: 3m 14s | remaining time: 0s | learning rate: 1.000000E-05 | global batch size: 64 | loss: 0.000000E+00 | reward: -6.945313E-01 | reward_std: 3.756505E-02 | frac_reward_zero_std: 9.375000E-01 | rewards/Reward/mean: -6.945312E-01 | rewards/Reward/std: 6.552404E-01 | clip_ratio/low_mean: 0.000000E+00 | clip_ratio/high_mean: 0.000000E+00 | clip_ratio/region_mean: 0.000000E+00 | completions/mean_length: 7.893281E+02 | completions/max_length: 1.117562E+03 | completions/min_length: 4.672500E+02 | clip_ratio/low_min: 0.000000E+00 | clip_ratio/high_max: 0.000000E+00 | load_balancing_loss: 1.735384E+00 | loss scale: 1.0 | grad norm: 0.064 | number of skipped iterations: 0 | number of nan iterations: 0 | [after training is done] datetime: 2026-02-13 23:08:04 saving checkpoint at iteration 2 to qwen3_vl_30b_a3b_instruct_grpo_v3/v8-20260213-225959/checkpoint-2 in torch_dist format Storing distributed optimizer sharded state of type fully_sharded_model_space successfully saved checkpoint from iteration 2 to qwen3_vl_30b_a3b_instruct_grpo_v3/v8-20260213-225959/checkpoint-2 [ t 1/4, p 1/1 ] [INFO:swift] Successfully saved safetensorsmodel weights inqwen3_vl_30b_a3b_instruct_grpo_v3/v8-20260213-225959/checkpoint-2-merged`.
[INFO:swift] End time of running main: 2026-02-13 23:13:54.162693
[rank4]:[W213 23:14:00.507426805 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank5]:[W213 23:14:00.565532420 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank2]:[W213 23:14:00.114979707 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W213 23:14:00.177886723 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank3]:[W213 23:14:01.336360975 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank1]:[W213 23:14:01.356616125 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
!!!!!!! Segfault encountered !!!!!!!

!!!!!!! Segfault encountered !!!!!!!`

How to Reproduce / 如何复现

`#!/bin/bash

MEGATRON_LM_PATH=
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NPROC_PER_NODE=8
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
megatron rlhf
--rlhf_type grpo
--load "${LOAD_PATH}"
--dataset "${DATASET_PATH}"
--save "${SAVE_PATH}"
--load_safetensors false
--save_safetensors true
--merge_lora true
--split_dataset_ratio 0
--moe_permute_fusion true
--tensor_model_parallel_size 4
--expert_tensor_parallel_size 1
--expert_model_parallel_size 4
--moe_grouped_gemm true
--moe_shared_expert_overlap true
--moe_aux_loss_coeff 1e-3
--max_epochs 1
--global_batch_size 64
--micro_batch_size 2
--steps_per_generation 2
--num_generations 8
--external_plugins "reward_func_v3.py"
--reward_funcs external_me
--use_vllm true
--vllm_mode colocate
--vllm_gpu_memory_utilization 0.3
--vllm_tensor_parallel_size 4
--vllm_max_model_len 16384
--max_length 8192
--max_completion_length 8192
--train_type lora
--lora_rank 128
--lora_alpha 256
--target_modules all-linear
--freeze_vit true
--lr 5e-5
--lr_warmup_fraction 0.05
--min_lr 1e-5
--bf16 true
--save_interval 200
--beta 0.00
--importance_sampling_level sequence
--epsilon 3e-4
--epsilon_high 4e-4
--dynamic_sample false
--overlong_filter true
--loss_type grpo
--sleep_level 2
--offload_model true
--offload_bridge false
--offload_optimizer true
--log_interval 1
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--finetune true
--num_workers 8
--dataset_num_proc 8
--no_save_optim true
--no_save_rng true
--attention_backend flash
--temperature 1.0
--padding_free true
--sequence_parallel true
--log_completions true
--tensorboard_dir "${LOG_PATH}/tensorboard"
2>&1 | tee "${LOG_PATH}/training_$(date +%Y%m%d_%H%M%S).log"`

Additional Information / 补充信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions