Skip to content

megatron sft训练时,socketPollConnect: connect to 169.254.20.10<52597> returned Connection refused #8043

@leuitong

Description

@leuitong

Checklist / 检查清单

  • I have searched existing issues, and this is a new question or discussion topic. / 我已经搜索过现有的 issues,确认这是一个新的问题与讨论。

Question Description / 问题描述

Qwen3-32B,2台h20,想尝试一下pp,指定了pipeline_model_parallel_size=2,过程中报NCCL连通问题,52597这个端口如何指定,env中的master_port是20000

[rank0]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 77, in run
[rank0]: self.trainer.train(train_dataset, val_dataset, data_collator)
[rank0]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 1098, in train
[rank0]: pretrain(
[rank0]: File "/root/.cache/modelscope/_github/Megatron-LM/megatron/training/training.py", line 586, in pretrain
[rank0]: initialize_megatron(
[rank0]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 108, in initialize_megatron
[rank0]: res = origin_initialize_megatron(*_args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.cache/modelscope/_github/Megatron-LM/megatron/training/initialize.py", line 162, in initialize_megatron
[rank0]: _compile_dependencies()
[rank0]: File "/root/.cache/modelscope/_github/Megatron-LM/megatron/training/initialize.py", line 221, in _compile_dependencies
[rank0]: torch.distributed.barrier()
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4881, in barrier
[rank0]: work = group.barrier(opts=opts)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3690, remote process exited or there was a network error, NCCL version 2.27.5
[rank0]: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
[rank0]: Last error:
[rank0]: socketPollConnect: connect to 169.254.20.10<52597> returned Connection refused, exceeded error retry count after 35 attempts

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions