Skip to content

grpo训练超时 rollout成功拉起 但是grpo训练报错 XXX:51216 timeour #8053

@alanshaoTT

Description

@alanshaoTT

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

在rollout server拉起后,启动grpo显示找不到vllm服务ip的51216,会超时

Image Image

How to Reproduce / 如何复现

rollout代码是

# Node A
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

export VLLM_SERVER_GROUP_HOST=0.0.0.0
export VLLM_SERVER_GROUP_PORT=51216

export VLLM_USE_V1=0
SWIFT_WORKSPACE="/mnt/data/share-ssd/user/shaomingchen/code/long-audio/time-aware/ft_omni/ms-swift-2-9"
export PYTHONPATH="${SWIFT_WORKSPACE}:${PYTHONPATH}"
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export LD_LIBRARY_PATH="/mnt/data/share-ks3/shaomingchen/qh_transf/pure_compiler:$LD_LIBRARY_PATH"

VLLM_USE_V1=0 \
swift rollout \
    --model "/mnt/data/share-oss/user/shaomingchen/ckpt/long-audio/time-aware/ceshi/exp6-agentic-TAG/v1-20260207-183313/checkpoint-900" \
    --vllm_tensor_parallel_size 8 \
    --vllm_max_model_len 32768 \
    --vllm_gpu_memory_utilization 0.9 \
    --multi_turn_scheduler AudioAgentScheduler \
    --vllm_limit_mm_per_prompt '{"audio": 6}' \
    --use_async_engine true \
    --vllm_enforce_eager true \
    --vllm_engine_kwargs '{"attention_config": {"backend": "FLASH_ATTN"}}' \
    --external_plugins "${SWIFT_WORKSPACE}/examples/train/multimodal/my-RL/exp9_plugin.py" \
    --port 8000 \
    --host 0.0.0.0

grpo训练代码是

export TRITON_CACHE_DIR="/tmp/triton_cache_$(whoami)_node_${RANK}"
export TORCHINDUCTOR_CACHE_DIR="/tmp/torch_inductor_cache_$(whoami)_node_${RANK}"
LOG_DIR="/mnt/data/share-ssd/user/shaomingchen/code/long-audio/time-aware/ft_omni/ms-swift-2-9/examples/train/multimodal/logs"
mkdir -p "$LOG_DIR"
CURRENT_TIME=$(date "+%Y%m%d_%H%M%S")
LOG_FILE="${LOG_DIR}/${CURRENT_TIME}_grpo_rollout_node_${RANK}.log"

SWIFT_WORKSPACE="/mnt/data/share-ssd/user/shaomingchen/code/long-audio/time-aware/ft_omni/ms-swift-2-9"
MY_RL_PATH="${SWIFT_WORKSPACE}/examples/train/multimodal/my-RL"

export PYTHONPATH="${MY_RL_PATH}:${SWIFT_WORKSPACE}:${MEGATRON_LM_PATH}:${PYTHONPATH}"

export MS_OFFLINE=1
export HF_HUB_OFFLINE=1
export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'

MODEL_PATH="/mnt/data/share-oss/user/shaomingchen/ckpt/long-audio/time-aware/ceshi/exp6-agentic-TAG/v1-20260207-183313/checkpoint-900"

TRAIN_DATASET="/mnt/data/share-ssd/user/shaomingchen/code/long-audio/time-aware/ft_omni/ms-swift-2-9/examples/train/multimodal/data/agentic_data/TAG-RL/rl_selected_for_sft.jsonl"

VAL_DATASET="/mnt/data/share-ssd/user/shaomingchen/code/long-audio/time-aware/ft_omni/ms-swift-2-9/examples/train/multimodal/data/agentic_data/TAG-RL/dev.jsonl"

OUTPUT_DIR="/mnt/data/share-oss/user/shaomingchen/ckpt/long-audio/time-aware/ceshi/exp9-fsdp-grpo-rollout"

ROLLOUT_HOST="10.34.79.116"
ROLLOUT_PORT=8000
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=29501

torchrun \
    --nproc_per_node=8 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    $(which swift) rlhf \
    \
    --rlhf_type grpo \
    --model "$MODEL_PATH" \
    --template qwen3_omni \
    --agent_template hermes \
    \
    --dataset "$TRAIN_DATASET" \
    --val_dataset "$VAL_DATASET" \
    \
    --external_plugins "${MY_RL_PATH}/exp9_plugin.py" \
    \
    --reward_funcs format_reward_func iou_reward_func think_content_reward_func \
    \
    --multi_turn_scheduler AudioAgentScheduler \
    --max_turns 5 \
    \
    --num_generations 8 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --max_length 32768 \
    \
    --use_vllm true \
    --vllm_mode server \
    --vllm_server_host $ROLLOUT_HOST \
    --vllm_server_port $ROLLOUT_PORT \
    --vllm_server_pass_dataset true \
    --sleep_level 1 \
    --vllm_server_timeout 120 \
    \
    --deepspeed "${MY_RL_PATH}/exp9_deepspeed.json" \
    \
    --bf16 true \
    --output_dir "$OUTPUT_DIR" \
    \
    2>&1 | tee "$LOG_FILE"

Additional Information / 补充信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions