-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Checklist / 检查清单
- I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。
Bug Description / Bug 描述
[2026-02-13 23:08:04] iteration 2/ 2 | consumed samples: 128 | elapsed time per iteration (ms): 36552.1 | memory(GiB): 52.07 | elapsed time: 3m 14s | remaining time: 0s | learning rate: 1.000000E-05 | global batch size: 64 | loss: 0.000000E+00 | reward: -6.945313E-01 | reward_std: 3.756505E-02 | frac_reward_zero_std: 9.375000E-01 | rewards/Reward/mean: -6.945312E-01 | rewards/Reward/std: 6.552404E-01 | clip_ratio/low_mean: 0.000000E+00 | clip_ratio/high_mean: 0.000000E+00 | clip_ratio/region_mean: 0.000000E+00 | completions/mean_length: 7.893281E+02 | completions/max_length: 1.117562E+03 | completions/min_length: 4.672500E+02 | clip_ratio/low_min: 0.000000E+00 | clip_ratio/high_max: 0.000000E+00 | load_balancing_loss: 1.735384E+00 | loss scale: 1.0 | grad norm: 0.064 | number of skipped iterations: 0 | number of nan iterations: 0 | [after training is done] datetime: 2026-02-13 23:08:04 saving checkpoint at iteration 2 to qwen3_vl_30b_a3b_instruct_grpo_v3/v8-20260213-225959/checkpoint-2 in torch_dist format Storing distributed optimizer sharded state of type fully_sharded_model_space successfully saved checkpoint from iteration 2 to qwen3_vl_30b_a3b_instruct_grpo_v3/v8-20260213-225959/checkpoint-2 [ t 1/4, p 1/1 ] [INFO:swift] Successfully saved safetensorsmodel weights inqwen3_vl_30b_a3b_instruct_grpo_v3/v8-20260213-225959/checkpoint-2-merged`.
[INFO:swift] End time of running main: 2026-02-13 23:13:54.162693
[rank4]:[W213 23:14:00.507426805 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank5]:[W213 23:14:00.565532420 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank2]:[W213 23:14:00.114979707 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W213 23:14:00.177886723 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank3]:[W213 23:14:01.336360975 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank1]:[W213 23:14:01.356616125 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
!!!!!!! Segfault encountered !!!!!!!
!!!!!!! Segfault encountered !!!!!!!`
How to Reproduce / 如何复现
`#!/bin/bash
MEGATRON_LM_PATH=
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NPROC_PER_NODE=8
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
megatron rlhf
--rlhf_type grpo
--load "${LOAD_PATH}"
--dataset "${DATASET_PATH}"
--save "${SAVE_PATH}"
--load_safetensors false
--save_safetensors true
--merge_lora true
--split_dataset_ratio 0
--moe_permute_fusion true
--tensor_model_parallel_size 4
--expert_tensor_parallel_size 1
--expert_model_parallel_size 4
--moe_grouped_gemm true
--moe_shared_expert_overlap true
--moe_aux_loss_coeff 1e-3
--max_epochs 1
--global_batch_size 64
--micro_batch_size 2
--steps_per_generation 2
--num_generations 8
--external_plugins "reward_func_v3.py"
--reward_funcs external_me
--use_vllm true
--vllm_mode colocate
--vllm_gpu_memory_utilization 0.3
--vllm_tensor_parallel_size 4
--vllm_max_model_len 16384
--max_length 8192
--max_completion_length 8192
--train_type lora
--lora_rank 128
--lora_alpha 256
--target_modules all-linear
--freeze_vit true
--lr 5e-5
--lr_warmup_fraction 0.05
--min_lr 1e-5
--bf16 true
--save_interval 200
--beta 0.00
--importance_sampling_level sequence
--epsilon 3e-4
--epsilon_high 4e-4
--dynamic_sample false
--overlong_filter true
--loss_type grpo
--sleep_level 2
--offload_model true
--offload_bridge false
--offload_optimizer true
--log_interval 1
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--finetune true
--num_workers 8
--dataset_num_proc 8
--no_save_optim true
--no_save_rng true
--attention_backend flash
--temperature 1.0
--padding_free true
--sequence_parallel true
--log_completions true
--tensorboard_dir "${LOG_PATH}/tensorboard"
2>&1 | tee "${LOG_PATH}/training_$(date +%Y%m%d_%H%M%S).log"`
Additional Information / 补充信息
No response