Skip to content

Support distributed prefill with local decode #304

Description

@lobanov

It should be possible to run generation locally on one device even if prefill is distributed. For example, it may be practical in a setup that uses DGX Spark to offload some of even all the prefill and then a Mac for generation. DGX Spark has a very fast GPU in theory, and Mac has very fast memory, so offloading prefill to the former while keeping all the generation on the latter could be the way to speed up large-context generation. This is also practical on homogeneous hardware, such as two Macs, because local decode is faster.

Conceptually it's similar to Kimi's Mooncake architecture, but at much smaller scale.

Goal: allow ds4 to pipeline prefill across two nodes while having the node owning the output layer to decode fully locally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions