Skip to content

wrong linkage between GPU ops and their CPU parents #196

@theodorbadea

Description

@theodorbadea

Introduced in: #185

Previously, linkage was done based on the Ev idx attribute of the nodes and on the timestamps. This PR introduced linkage based on External id attribute, but this fails to properly link the collective communication nodes to their CPU parent.

For example, a ncclDevKernel... operation (which is a collective communication) should be linked to a nccl:.. operation (which is a runtime op, as far as I believe) and nccl:... should be linked to a record_param_comms operation. This was the behavior before the PR.

Now, a ncclDevKernel... is wrongly linked to a record_param_comms, thus basically ending up with the nccl:... runtime operation (which is a COMP_NODE) being executed in parallel with the ncclDevKernel....

The cause of the wrong linkage is the usage of external id, as can be seen in the following example:

ncclDevKernel_AllGather_RING_LL with External id 17
nccl:_all_gather_base with External id 18
record_param_comms with External id 17

The dependencies should be as follows: record_param_comms -> nccl:_all_gather_base ->ncclDevKernel_AllGather_RING_LL, but because External id does not match, the ncclDevKernel_AllGather_RING_LL is linked to record_param_comms instead of nccl:_all_gather_base.

Furthermore, this is confirmed by analyzing the timestamps of each operation: the closest to ncclDevKernel_AllGather_RING_LL in terms of time is nccl:_all_gather_base and not record_param_comms.

@JoongunPark , any particular reason for changing Ev idx linkage to External id? Is Ev idx unreliable?

cc: @winstonliu1111 @danmih-ixia

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions