-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Checklist / 检查清单
- I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。
Bug Description / Bug 描述
在rollout server拉起后,启动grpo显示找不到vllm服务ip的51216,会超时
How to Reproduce / 如何复现
rollout代码是
# Node A
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export VLLM_SERVER_GROUP_HOST=0.0.0.0
export VLLM_SERVER_GROUP_PORT=51216
export VLLM_USE_V1=0
SWIFT_WORKSPACE="/mnt/data/share-ssd/user/shaomingchen/code/long-audio/time-aware/ft_omni/ms-swift-2-9"
export PYTHONPATH="${SWIFT_WORKSPACE}:${PYTHONPATH}"
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export LD_LIBRARY_PATH="/mnt/data/share-ks3/shaomingchen/qh_transf/pure_compiler:$LD_LIBRARY_PATH"
VLLM_USE_V1=0 \
swift rollout \
--model "/mnt/data/share-oss/user/shaomingchen/ckpt/long-audio/time-aware/ceshi/exp6-agentic-TAG/v1-20260207-183313/checkpoint-900" \
--vllm_tensor_parallel_size 8 \
--vllm_max_model_len 32768 \
--vllm_gpu_memory_utilization 0.9 \
--multi_turn_scheduler AudioAgentScheduler \
--vllm_limit_mm_per_prompt '{"audio": 6}' \
--use_async_engine true \
--vllm_enforce_eager true \
--vllm_engine_kwargs '{"attention_config": {"backend": "FLASH_ATTN"}}' \
--external_plugins "${SWIFT_WORKSPACE}/examples/train/multimodal/my-RL/exp9_plugin.py" \
--port 8000 \
--host 0.0.0.0
grpo训练代码是
export TRITON_CACHE_DIR="/tmp/triton_cache_$(whoami)_node_${RANK}"
export TORCHINDUCTOR_CACHE_DIR="/tmp/torch_inductor_cache_$(whoami)_node_${RANK}"
LOG_DIR="/mnt/data/share-ssd/user/shaomingchen/code/long-audio/time-aware/ft_omni/ms-swift-2-9/examples/train/multimodal/logs"
mkdir -p "$LOG_DIR"
CURRENT_TIME=$(date "+%Y%m%d_%H%M%S")
LOG_FILE="${LOG_DIR}/${CURRENT_TIME}_grpo_rollout_node_${RANK}.log"
SWIFT_WORKSPACE="/mnt/data/share-ssd/user/shaomingchen/code/long-audio/time-aware/ft_omni/ms-swift-2-9"
MY_RL_PATH="${SWIFT_WORKSPACE}/examples/train/multimodal/my-RL"
export PYTHONPATH="${MY_RL_PATH}:${SWIFT_WORKSPACE}:${MEGATRON_LM_PATH}:${PYTHONPATH}"
export MS_OFFLINE=1
export HF_HUB_OFFLINE=1
export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
MODEL_PATH="/mnt/data/share-oss/user/shaomingchen/ckpt/long-audio/time-aware/ceshi/exp6-agentic-TAG/v1-20260207-183313/checkpoint-900"
TRAIN_DATASET="/mnt/data/share-ssd/user/shaomingchen/code/long-audio/time-aware/ft_omni/ms-swift-2-9/examples/train/multimodal/data/agentic_data/TAG-RL/rl_selected_for_sft.jsonl"
VAL_DATASET="/mnt/data/share-ssd/user/shaomingchen/code/long-audio/time-aware/ft_omni/ms-swift-2-9/examples/train/multimodal/data/agentic_data/TAG-RL/dev.jsonl"
OUTPUT_DIR="/mnt/data/share-oss/user/shaomingchen/ckpt/long-audio/time-aware/ceshi/exp9-fsdp-grpo-rollout"
ROLLOUT_HOST="10.34.79.116"
ROLLOUT_PORT=8000
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=29501
torchrun \
--nproc_per_node=8 \
--nnodes=1 \
--node_rank=0 \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
$(which swift) rlhf \
\
--rlhf_type grpo \
--model "$MODEL_PATH" \
--template qwen3_omni \
--agent_template hermes \
\
--dataset "$TRAIN_DATASET" \
--val_dataset "$VAL_DATASET" \
\
--external_plugins "${MY_RL_PATH}/exp9_plugin.py" \
\
--reward_funcs format_reward_func iou_reward_func think_content_reward_func \
\
--multi_turn_scheduler AudioAgentScheduler \
--max_turns 5 \
\
--num_generations 8 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 4 \
--max_length 32768 \
\
--use_vllm true \
--vllm_mode server \
--vllm_server_host $ROLLOUT_HOST \
--vllm_server_port $ROLLOUT_PORT \
--vllm_server_pass_dataset true \
--sleep_level 1 \
--vllm_server_timeout 120 \
\
--deepspeed "${MY_RL_PATH}/exp9_deepspeed.json" \
\
--bf16 true \
--output_dir "$OUTPUT_DIR" \
\
2>&1 | tee "$LOG_FILE"
Additional Information / 补充信息
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working