Support distributed prefill with local decode

It should be possible to run generation locally on one device even if prefill is distributed. For example, it may be practical in a setup that uses DGX Spark to offload some of even all the prefill and then a Mac for generation. DGX Spark has a very fast GPU in theory, and Mac has very fast memory, so offloading prefill to the former while keeping all the generation on the latter could be the way to speed up large-context generation. This is also practical on homogeneous hardware, such as two Macs, because local decode is faster.

Conceptually it's similar to [Kimi's Mooncake architecture](https://arxiv.org/abs/2407.00079), but at much smaller scale.

**Goal**: allow ds4 to pipeline prefill across two nodes while having the node owning the `output` layer to decode fully locally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support distributed prefill with local decode #304

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support distributed prefill with local decode #304

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions