Merge routed_experts across prefill/decode in P/D disaggregation by S1ro1 · Pull Request #184 · vllm-project/router

S1ro1 · 2026-06-12T15:20:20Z

Merge `routed_experts` across prefill/decode in P/D disaggregation

Background

vLLM can capture, per generated token, which MoE experts each token was routed to
("routed experts capture", used for routing replay / reproducible RL). When a
backend is launched with --enable-return-routed-experts, every response choice
carries a routed_experts field — a base64-encoded NumPy .npy blob of shape
(num_tokens - 1, num_layers, num_experts_per_tok) (dtype uint8/uint16). This
appears on /v1/completions, /v1/chat/completions, and the disaggregated
/inference/v1/generate responses.

Problem

Under P/D disaggregation the decode replica pulls the prompt's KV cache from
the prefill replica and therefore never forwards the prompt tokens — so the
prompt-region rows of the decode response's routed_experts are invalid. The
prefill replica forwards the whole prompt and returns the correct prompt-region
routing.

Fix

The router splices the prefill replica's prompt routing over the decode response's
invalid prefix:

merged = concat( prefill_rows[0:Lp], decode_rows[Lp:Ld] )      # keeps decode shape/dtype

where Lp/Ld are the prefill/decode row counts. This is a pure response-body
splice keyed on array lengths — independent of the KV connector (NIXL,
Mooncake, MoRI-IO). It runs on the existing non-streaming merge path in
vllm_pd_router.rs, right alongside the logprobs merge, gated on the prefill
response actually carrying routed_experts. Streaming responses (which never carry
routed_experts) are untouched.

Implementation

New src/routers/http/routed_experts_merge.rs: a minimal .npy header parser,
the byte-level row splice (splice_npy), and merge_routed_experts_in_json
(matches choices by index). Guards on dtype / trailing-shape mismatch, null
fields, and Lp > Ld. Unit-tested (cargo test --lib routed_experts).
vllm_pd_router.rs: both non-streaming merge sites now also merge
routed_experts when present.
Adds a direct base64 dependency (already in the lock via reqwest).

Verification

Verified end-to-end on a 2-node prefill/decode deployment of Qwen3-30B-A3B with the
NIXL and Mooncake connectors, across /v1/completions and /v1/chat/completions.
Correctness was checked against a non-disaggregated oracle replica under greedy
decoding: the router-merged routed_experts matches the oracle elementwise, while
the raw (un-merged) decode response differs in the prompt region — confirming the
merge is both correct and necessary.

Companion changes: vLLM relaxes its routed-experts/KV-connector rejection to a
warning (vllm-project/vllm), and llm-d/llm-d-router gets the equivalent merge.

Related PRs: vllm-project/vllm#45419 (relax the capture/KV-connector check) · llm-d/llm-d-router#1627 (equivalent merge in the llm-d sidecar)

When a vLLM backend is launched with --enable-return-routed-experts, every response choice carries a routed_experts field (base64-encoded .npy of shape (num_tokens-1, num_layers, num_experts_per_tok)). Under P/D disaggregation the decode replica pulls the prompt KV from prefill and never forwards the prompt, so its prompt-region routing rows are invalid. The prefill replica returns the correct prompt-region routing, so the router splices it over the decode prefix: merged = concat(prefill_rows[:Lp], decode_rows[Lp:]) This is a pure response-body splice, keyed on array lengths and independent of the KV connector (NIXL / Mooncake / MoRI-IO). It runs on the existing non-streaming merge path alongside logprobs merging; streaming responses (which never carry routed_experts) are untouched. New module routed_experts_merge.rs implements a minimal .npy header parser and the row splice, with unit tests. Signed-off-by: Matej Sirovatka <S1ro1@users.noreply.github.com>

This was referenced Jun 12, 2026

Merge routed_experts across prefill/decode in the P/D sidecar llm-d/llm-d-router#1627

Open

Relax routed_experts capture KV-connector check to a warning for P/D vllm-project/vllm#45419

Draft

S1ro1 force-pushed the feat/routed-experts-pd-merge branch 2 times, most recently from 42a3550 to 6c4c78e Compare June 12, 2026 15:59

S1ro1 force-pushed the feat/routed-experts-pd-merge branch from 6c4c78e to c3cf112 Compare June 12, 2026 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge routed_experts across prefill/decode in P/D disaggregation#184

Merge routed_experts across prefill/decode in P/D disaggregation#184
S1ro1 wants to merge 1 commit into
vllm-project:mainfrom
S1ro1:feat/routed-experts-pd-merge

S1ro1 commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

S1ro1 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!