-
Notifications
You must be signed in to change notification settings - Fork 62
Description
Introduced in: #185
Previously, linkage was done based on the Ev idx attribute of the nodes and on the timestamps. This PR introduced linkage based on External id attribute, but this fails to properly link the collective communication nodes to their CPU parent.
For example, a ncclDevKernel... operation (which is a collective communication) should be linked to a nccl:.. operation (which is a runtime op, as far as I believe) and nccl:... should be linked to a record_param_comms operation. This was the behavior before the PR.
Now, a ncclDevKernel... is wrongly linked to a record_param_comms, thus basically ending up with the nccl:... runtime operation (which is a COMP_NODE) being executed in parallel with the ncclDevKernel....
The cause of the wrong linkage is the usage of external id, as can be seen in the following example:
ncclDevKernel_AllGather_RING_LL with External id 17
nccl:_all_gather_base with External id 18
record_param_comms with External id 17
The dependencies should be as follows: record_param_comms -> nccl:_all_gather_base ->ncclDevKernel_AllGather_RING_LL, but because External id does not match, the ncclDevKernel_AllGather_RING_LL is linked to record_param_comms instead of nccl:_all_gather_base.
Furthermore, this is confirmed by analyzing the timestamps of each operation: the closest to ncclDevKernel_AllGather_RING_LL in terms of time is nccl:_all_gather_base and not record_param_comms.
@JoongunPark , any particular reason for changing Ev idx linkage to External id? Is Ev idx unreliable?