-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Checklist / 检查清单
- I have searched existing issues, and this is a new question or discussion topic. / 我已经搜索过现有的 issues,确认这是一个新的问题与讨论。
Question Description / 问题描述
Qwen3-32B,2台h20,想尝试一下pp,指定了pipeline_model_parallel_size=2,过程中报NCCL连通问题,52597这个端口如何指定,env中的master_port是20000
[rank0]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 77, in run
[rank0]: self.trainer.train(train_dataset, val_dataset, data_collator)
[rank0]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 1098, in train
[rank0]: pretrain(
[rank0]: File "/root/.cache/modelscope/_github/Megatron-LM/megatron/training/training.py", line 586, in pretrain
[rank0]: initialize_megatron(
[rank0]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 108, in initialize_megatron
[rank0]: res = origin_initialize_megatron(*_args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.cache/modelscope/_github/Megatron-LM/megatron/training/initialize.py", line 162, in initialize_megatron
[rank0]: _compile_dependencies()
[rank0]: File "/root/.cache/modelscope/_github/Megatron-LM/megatron/training/initialize.py", line 221, in _compile_dependencies
[rank0]: torch.distributed.barrier()
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4881, in barrier
[rank0]: work = group.barrier(opts=opts)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3690, remote process exited or there was a network error, NCCL version 2.27.5
[rank0]: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
[rank0]: Last error:
[rank0]: socketPollConnect: connect to 169.254.20.10<52597> returned Connection refused, exceeded error retry count after 35 attempts