Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 13 additions & 13 deletions _posts/2026-05-06-mooncake-store.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,25 +18,25 @@ tags:

## Agentic workloads are reshaping LLM serving

With the rise of LLM agents such as Claude Code and OpenClaw, inference workloads are undergoing a fundamental shift. As Jensen highlighted in his GTC 2026 [keynote](https://www.nvidia.com/gtc/keynote/), LLMs are moving beyond simple chatbots towards autonomous, long-running systems that plan, reason, and act toward complex goals.
With the rise of LLM agents such as Claude Code and OpenClaw, inference workloads are undergoing a fundamental shift. As Jensen highlighted in his GTC 2026 [keynote](https://www.nvidia.com/gtc/keynote/), LLMs are moving beyond simple chatbots toward autonomous, long-running systems that plan, reason, and act toward complex goals.

What makes agentic workloads unique is their structure. They typically consist of long-horizon, multi-turn loops that alternate between a *reasoning step*, where the model processes context and produces intermediate thoughts, and an *action step*, where the model issues tool calls and receives external outputs.

To quantify this behavior, we collected and analyzed traces from Codex and GPT-5.4 on the SWE-bench Pro dataset. We have also open-sourced the dataset [here](https://huggingface.co/datasets/Inferact/codex_swebenchpro_traces) to encourage broader community study of agentic serving workloads.

Figure 1 shows a representative trace from an agentic session.
Figure 1 summarizes the Codex/SWE-bench Pro traces and shows a representative agentic session.

<p align="center">
<img src="/assets/figures/2026-05-06-mooncake-store/agentic_trace.svg" width="90%">
<br>
<em>Figure 1: Codex/SWE-bench Pro workload characteristics across 610 traces, with a median of 33 turns per trace. The traces show a 94.2% cache hit rate, a 131:1 input-to-output token ratio, average context growth of about 2,242 tokens per turn, median context growth from 12K to 80K tokens per trace, and inter-turn delays ranging from 5.2s to 81.4s.</em>
<em>Figure 1: Anatomy of an agentic trace from the Codex/SWE-bench Pro corpus. Each row is one LLM call; per-turn sizes use medians across 610 traces. The cached prefix (system prompt, skills/memory, prior turns' history) is reused turn after turn, while only the new tool output and the model's decode are active each turn.</em>
</p>

The pattern is striking: by turn 30, context length grows to roughly **80K tokens**, and the longest contexts can grow beyond **180K tokens**. Yet each turn typically introduces only a few hundred to a few thousand new tokens. The rest is prefix that the model has already seen. Across the dataset, the average input-to-output token ratio is roughly **131:1**.

If we can cache those prefixes, prefill for the cached portion becomes essentially free. The true per-turn cost is only the new delta.

Across the Codex/SWE-bench Pro dataset, including 610 traces with a median of 33 turns per trace, we observe:
Across the Codex/SWE-bench Pro dataset, comprising 610 traces with a median of 33 turns per trace, we observe:

- 94.2% cache hit rate
- 131:1 input-to-output ratio
Expand All @@ -63,13 +63,13 @@ Figure 2 depicts the overall design.
<em>Figure 2: Overall design of the vLLM distributed KV cache pool. Multiple vLLM instances embed Mooncake clients and share a cluster-wide Mooncake Store. The Mooncake master manages KV-block metadata, service discovery, and client health, while workers transfer KV blocks between GPU HBM and the distributed DRAM or SSD pool over RDMA.</em>
</p>

At a high level, Mooncake Store offers a master server and a set of clients. The master server runs cluster-wide and manages the metadata, including KV block hashes, sizes, etc. It also monitors client health and availability, providing service discovery and dead-node cleanup.
At a high level, Mooncake Store offers a master server and a set of clients. The master server runs cluster-wide and manages metadata, including KV block hashes, sizes, etc. It also monitors client health and availability, providing service discovery and dead-node cleanup.

Mooncake clients run on GPU nodes where they manage local CPU/DRAM/SSD resources. Clients connect to one another through RDMA for KV cache transfer. Together, they form a distributed KV cache pool.

The vLLM integration plugs into the existing [`KVConnector`](https://github.com/vllm-project/vllm/blob/db9a84e0cd0e17ab693467ff4a71103abd4b77bf/vllm/distributed/kv_transfer/kv_connector/v1/base.py) interface, the same abstraction used for PD disaggregation. The connector has two roles:

On the **scheduler side**, when a new request arrives, vLLM hashes the prompt's token blocks and queries the Mooncake master for matching KV cache blocks, which is used for guiding scheduling decisions.
On the **scheduler side**, when a new request arrives, vLLM hashes the prompts token blocks, queries the Mooncake master for matching KV cache blocks, and uses the result to guide scheduling decisions.

On the **worker side**, vLLM embeds a Mooncake client in each GPU worker and launches background threads for data movement. GPU KV cache memory is registered as RDMA buffers, enabling GPUDirect RDMA reads and writes through the Mooncake client without using SMs or staging through CPU memory.

Expand All @@ -83,7 +83,7 @@ We take a third approach: using the RDMA NIC and GPUDirect RDMA to move KV block

Thanks to the Mooncake Transfer Engine, the transfer path can also leverage multiple RNICs on a node through multi-NIC pooling and topology-aware path selection. This allows KV transfers to aggregate and better utilize available network bandwidth across NICs.

### Fully async transfer
### Fully asynchronous transfer

Although RDMA operations are asynchronous, preparing descriptors and issuing RDMA reads and writes still requires non-trivial CPU work. This overhead grows with sequence length because longer sequences contain more KV blocks.

Expand All @@ -99,15 +99,15 @@ The integration also naturally extends to PD disaggregation through the [`MultiC
<em>Figure 3: PD disaggregation combined with the distributed KV cache pool via MultiConnector.</em>
</p>

**Prefill.** The prefill instance prepares KV blocks for the PD connector and also stores them in the distributed KV cache pool through the store connector. For cache hits, vLLM queries all connectors and can recover matching prefixes from the Mooncake Store connector.
**Prefill:** The prefill instance prepares KV blocks for the PD connector and also stores them in the distributed KV cache pool through the store connector. For cache hits, vLLM queries all connectors and can recover matching prefixes from the Mooncake Store connector.

**Decode.** Storing KV blocks from decode is straightforward: once KV blocks are written into the distributed KV cache pool, they become visible to prefill instances. Today, because vLLM schedules each request to both a prefill instance and a decode instance, the decode instance is guaranteed to receive all required KV caches through the PD connector. This relies on the prefill instance to load prefix KV blocks from the distributed pool.
**Decode:** When the decode instance writes KV blocks into the distributed pool, they immediately become visible to prefill instances. Decode itself does not currently read from the pool: because vLLM schedules each request to both a prefill and a decode instance, the prefill instance loads any prefix KV blocks from the pool and forwards them to decode through the PD connector.

We are working to enable multi-path KV cache loading from both the prefill instance and the distributed pool, which would maximize available network bandwidth.

## Performance

The current implementation is available [here](https://github.com/vllm-project/vllm/pull/40900). We also provide benchmark scripts in an associated artifact [here](https://github.com/ivanium/vllm/tree/feat/mooncake-store-int/scripts/mooncake/artifacts). In this post, we highlight two results.
The current implementation is available [here](https://github.com/vllm-project/vllm/pull/40900). We also provide benchmark scripts in the artifact repository [here](https://github.com/ivanium/vllm/tree/feat/mooncake-store-int/scripts/mooncake/artifacts). In this post, we highlight two results.

We ran the Kimi-2.5 NVFP4 model on GB200 nodes with PD disaggregation. The prefill instance used TP4, while the decode instance used DP8 + EP. We found that this configuration provided the best latency-throughput tradeoff.

Expand All @@ -121,11 +121,11 @@ We first evaluated vLLM in a realistic setting using the Codex agentic traces de
<em>Figure 4: vLLM with Mooncake Store vs. baseline on realistic Codex agentic traces (1P1D, 12 GB200 GPUs). The distributed KV cache pool improves throughput by 3.8x, reduces P50 TTFT by 46x, and reduces E2E latency by 8.6x, driven by a cache hit rate increase from 1.7% to 92.2%.</em>
</p>

The distributed KV cache pool improves vLLM throughput by **3.8x**, and reduces P50 TTFT and E2E latency by **46x** and **8.6x**, respectively. These gains are driven by a dramatic increase in cache hit rate: from **1.7%**, where only the system prompt is cached, to **92.2%**, where nearly the entire prefix is cached.
The distributed KV cache pool improves vLLM throughput by **3.8x** and reduces P50 TTFT and E2E latency by **46x** and **8.6x**, respectively. These gains are driven by a dramatic increase in cache hit rate: from **1.7%**, where only the system prompt is cached, to **92.2%**, where nearly the entire prefix is cached.

### Scaling out to multiple nodes

For the scalability test, we further increased the number of nodes and used a synthetic dataset derived from the Codex workload for controlled comparison.
For the scalability test, we further increased the number of nodes and used a synthetic dataset derived from the Codex workload for controlled scaling experiments.

Experiment settings:

Expand All @@ -135,7 +135,7 @@ Experiment settings:
- 900 output tokens
- 30 turns total
- Number of sessions scaled with number of GPUs: 75 → 150 → 225 → 300 → 375
- Parameters were chosen to align with the original Codex workload and keep the total output/input ratio ~1.3%
- Parameters were chosen to roughly align with the original Codex workload and keep the total output/input ratio ~1.3%

<p align="center">
<img src="/assets/figures/2026-05-06-mooncake-store/pd_scaling.png" width="90%">
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading