Distributed Training Error

Hi，When I try to train with multiple GPUs, I encounter a Rank timeout error. However, if I set --nproc_per_node=1, the code runs normally without any issue. My Environment is : torch 2.3.1，cuda 11.8，python 3.10 ，NCCL 2.18.3

<img width="1502" height="621" alt="Image" src="https://github.com/user-attachments/assets/1d3e7319-6912-4055-8964-3e285f166331" />