It should be possible to run generation locally on one device even if prefill is distributed. For example, it may be practical in a setup that uses DGX Spark to offload some of even all the prefill and then a Mac for generation. DGX Spark has a very fast GPU in theory, and Mac has very fast memory, so offloading prefill to the former while keeping all the generation on the latter could be the way to speed up large-context generation. This is also practical on homogeneous hardware, such as two Macs, because local decode is faster.
Conceptually it's similar to Kimi's Mooncake architecture, but at much smaller scale.
Goal: allow ds4 to pipeline prefill across two nodes while having the node owning the output layer to decode fully locally.
It should be possible to run generation locally on one device even if prefill is distributed. For example, it may be practical in a setup that uses DGX Spark to offload some of even all the prefill and then a Mac for generation. DGX Spark has a very fast GPU in theory, and Mac has very fast memory, so offloading prefill to the former while keeping all the generation on the latter could be the way to speed up large-context generation. This is also practical on homogeneous hardware, such as two Macs, because local decode is faster.
Conceptually it's similar to Kimi's Mooncake architecture, but at much smaller scale.
Goal: allow ds4 to pipeline prefill across two nodes while having the node owning the
outputlayer to decode fully locally.