Skip to content

Merge routed_experts across prefill/decode in P/D disaggregation#184

Draft
S1ro1 wants to merge 1 commit into
vllm-project:mainfrom
S1ro1:feat/routed-experts-pd-merge
Draft

Merge routed_experts across prefill/decode in P/D disaggregation#184
S1ro1 wants to merge 1 commit into
vllm-project:mainfrom
S1ro1:feat/routed-experts-pd-merge

Conversation

@S1ro1

@S1ro1 S1ro1 commented Jun 12, 2026

Copy link
Copy Markdown

Merge routed_experts across prefill/decode in P/D disaggregation

Background

vLLM can capture, per generated token, which MoE experts each token was routed to
("routed experts capture", used for routing replay / reproducible RL). When a
backend is launched with --enable-return-routed-experts, every response choice
carries a routed_experts field — a base64-encoded NumPy .npy blob of shape
(num_tokens - 1, num_layers, num_experts_per_tok) (dtype uint8/uint16). This
appears on /v1/completions, /v1/chat/completions, and the disaggregated
/inference/v1/generate responses.

Problem

Under P/D disaggregation the decode replica pulls the prompt's KV cache from
the prefill replica and therefore never forwards the prompt tokens — so the
prompt-region rows of the decode response's routed_experts are invalid. The
prefill replica forwards the whole prompt and returns the correct prompt-region
routing.

Fix

The router splices the prefill replica's prompt routing over the decode response's
invalid prefix:

merged = concat( prefill_rows[0:Lp], decode_rows[Lp:Ld] )      # keeps decode shape/dtype

where Lp/Ld are the prefill/decode row counts. This is a pure response-body
splice keyed on array lengths — independent of the KV connector (NIXL,
Mooncake, MoRI-IO). It runs on the existing non-streaming merge path in
vllm_pd_router.rs, right alongside the logprobs merge, gated on the prefill
response actually carrying routed_experts. Streaming responses (which never carry
routed_experts) are untouched.

Implementation

  • New src/routers/http/routed_experts_merge.rs: a minimal .npy header parser,
    the byte-level row splice (splice_npy), and merge_routed_experts_in_json
    (matches choices by index). Guards on dtype / trailing-shape mismatch, null
    fields, and Lp > Ld. Unit-tested (cargo test --lib routed_experts).
  • vllm_pd_router.rs: both non-streaming merge sites now also merge
    routed_experts when present.
  • Adds a direct base64 dependency (already in the lock via reqwest).

Verification

Verified end-to-end on a 2-node prefill/decode deployment of Qwen3-30B-A3B with the
NIXL and Mooncake connectors, across /v1/completions and /v1/chat/completions.
Correctness was checked against a non-disaggregated oracle replica under greedy
decoding: the router-merged routed_experts matches the oracle elementwise, while
the raw (un-merged) decode response differs in the prompt region — confirming the
merge is both correct and necessary.

Companion changes: vLLM relaxes its routed-experts/KV-connector rejection to a
warning (vllm-project/vllm), and llm-d/llm-d-router gets the equivalent merge.


Related PRs: vllm-project/vllm#45419 (relax the capture/KV-connector check) · llm-d/llm-d-router#1627 (equivalent merge in the llm-d sidecar)

When a vLLM backend is launched with --enable-return-routed-experts, every
response choice carries a routed_experts field (base64-encoded .npy of shape
(num_tokens-1, num_layers, num_experts_per_tok)). Under P/D disaggregation the
decode replica pulls the prompt KV from prefill and never forwards the prompt,
so its prompt-region routing rows are invalid. The prefill replica returns the
correct prompt-region routing, so the router splices it over the decode prefix:

    merged = concat(prefill_rows[:Lp], decode_rows[Lp:])

This is a pure response-body splice, keyed on array lengths and independent of
the KV connector (NIXL / Mooncake / MoRI-IO). It runs on the existing
non-streaming merge path alongside logprobs merging; streaming responses (which
never carry routed_experts) are untouched.

New module routed_experts_merge.rs implements a minimal .npy header parser and
the row splice, with unit tests.

Signed-off-by: Matej Sirovatka <S1ro1@users.noreply.github.com>
@S1ro1 S1ro1 force-pushed the feat/routed-experts-pd-merge branch from 6c4c78e to c3cf112 Compare June 12, 2026 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant