Merge routed_experts across prefill/decode in P/D disaggregation#184
Draft
S1ro1 wants to merge 1 commit into
Draft
Merge routed_experts across prefill/decode in P/D disaggregation#184S1ro1 wants to merge 1 commit into
S1ro1 wants to merge 1 commit into
Conversation
This was referenced Jun 12, 2026
42a3550 to
6c4c78e
Compare
When a vLLM backend is launched with --enable-return-routed-experts, every
response choice carries a routed_experts field (base64-encoded .npy of shape
(num_tokens-1, num_layers, num_experts_per_tok)). Under P/D disaggregation the
decode replica pulls the prompt KV from prefill and never forwards the prompt,
so its prompt-region routing rows are invalid. The prefill replica returns the
correct prompt-region routing, so the router splices it over the decode prefix:
merged = concat(prefill_rows[:Lp], decode_rows[Lp:])
This is a pure response-body splice, keyed on array lengths and independent of
the KV connector (NIXL / Mooncake / MoRI-IO). It runs on the existing
non-streaming merge path alongside logprobs merging; streaming responses (which
never carry routed_experts) are untouched.
New module routed_experts_merge.rs implements a minimal .npy header parser and
the row splice, with unit tests.
Signed-off-by: Matej Sirovatka <S1ro1@users.noreply.github.com>
6c4c78e to
c3cf112
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Merge
routed_expertsacross prefill/decode in P/D disaggregationBackground
vLLM can capture, per generated token, which MoE experts each token was routed to
("routed experts capture", used for routing replay / reproducible RL). When a
backend is launched with
--enable-return-routed-experts, every response choicecarries a
routed_expertsfield — a base64-encoded NumPy.npyblob of shape(num_tokens - 1, num_layers, num_experts_per_tok)(dtypeuint8/uint16). Thisappears on
/v1/completions,/v1/chat/completions, and the disaggregated/inference/v1/generateresponses.Problem
Under P/D disaggregation the decode replica pulls the prompt's KV cache from
the prefill replica and therefore never forwards the prompt tokens — so the
prompt-region rows of the decode response's
routed_expertsare invalid. Theprefill replica forwards the whole prompt and returns the correct prompt-region
routing.
Fix
The router splices the prefill replica's prompt routing over the decode response's
invalid prefix:
where
Lp/Ldare the prefill/decode row counts. This is a pure response-bodysplice keyed on array lengths — independent of the KV connector (NIXL,
Mooncake, MoRI-IO). It runs on the existing non-streaming merge path in
vllm_pd_router.rs, right alongside the logprobs merge, gated on the prefillresponse actually carrying
routed_experts. Streaming responses (which never carryrouted_experts) are untouched.Implementation
src/routers/http/routed_experts_merge.rs: a minimal.npyheader parser,the byte-level row splice (
splice_npy), andmerge_routed_experts_in_json(matches choices by
index). Guards on dtype / trailing-shape mismatch, nullfields, and
Lp > Ld. Unit-tested (cargo test --lib routed_experts).vllm_pd_router.rs: both non-streaming merge sites now also mergerouted_expertswhen present.base64dependency (already in the lock via reqwest).Verification
Verified end-to-end on a 2-node prefill/decode deployment of Qwen3-30B-A3B with the
NIXL and Mooncake connectors, across
/v1/completionsand/v1/chat/completions.Correctness was checked against a non-disaggregated oracle replica under greedy
decoding: the router-merged
routed_expertsmatches the oracle elementwise, whilethe raw (un-merged) decode response differs in the prompt region — confirming the
merge is both correct and necessary.
Companion changes: vLLM relaxes its routed-experts/KV-connector rejection to a
warning (vllm-project/vllm), and llm-d/llm-d-router gets the equivalent merge.
Related PRs: vllm-project/vllm#45419 (relax the capture/KV-connector check) · llm-d/llm-d-router#1627 (equivalent merge in the llm-d sidecar)