Reduce memory usage in distributed triangle updates

This request comes from https://github.com/NVIDIA/Fuser/pull/5890. Currently, nvFuser uses too much memory by allgathering one of the einsum's operands. 

<img width="386" height="707" alt="Image" src="https://github.com/user-attachments/assets/8585ea51-ee2c-46b6-9103-3d9acf4e4a67" />

A better (not sure if the best) approach is to stream-parallelize the allgather and the reducescatter:

<img width="1407" height="968" alt="Image" src="https://github.com/user-attachments/assets/0ce80693-ccc3-47a2-a0df-9098bcdfa9f9" />

This way, each GPU only has to store `O(b * s/dy * s/dx * c)`. 

In nvFuser, this can be represented as

<img width="388" height="830" alt="Image" src="https://github.com/user-attachments/assets/cbf3d3e4-faf3-4232-b039-157bb44a5cfe" />

Note that
```
dy
|
s
```
is a Swizzle1D similar to https://github.com/NVIDIA/Fuser/blob/231d48c002629b6f8b33a9bae4ed9ff440d1777c/tests/cpp/test_overlap.cpp#L70

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage in distributed triangle updates #5942

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduce memory usage in distributed triangle updates #5942

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions