Skip to content

feat(router): register /inference/v1/generate as a typed route#180

Open
aoshen02 wants to merge 1 commit into
vllm-project:mainfrom
aoshen02:fix/transparent-proxy-load-tracking
Open

feat(router): register /inference/v1/generate as a typed route#180
aoshen02 wants to merge 1 commit into
vllm-project:mainfrom
aoshen02:fix/transparent-proxy-load-tracking

Conversation

@aoshen02

@aoshen02 aoshen02 commented May 31, 2026

Copy link
Copy Markdown

Fixes #179.

Problem

vLLM's disaggregated engine serves generation at /inference/v1/generate (the /inference/v1/ prefix is vendor-private). The router has no first-class typed route for it, so it falls through to the transparent proxy. On that path cache_aware never tracks worker load, so it degrades to single-worker prefix affinity — one engine serves every shared-prefix request (and can OOM its KV cache) while the others idle.

Fix

Register /inference/v1/generate as a typed route, so it reuses route_typed_request's cache_aware routing + worker load tracking + retry + circuit breaker + metrics — identical to /generate.

The typed path forwards by re-serializing the request struct, so a new lossless InferenceGenerateRequest models only what the router reads — token_ids (the cache_aware routing key), stream, and model (for get_model, consistent with the other typed endpoints) — and passes everything else through with #[serde(flatten)]. The nested sampling_params is forwarded untouched, so seed / max_tokens / logprobs reach the engine unchanged (the router never interprets sampling params; mirroring vLLM's ~100-field SamplingParams would be pure drift risk). GenerateRequest and /generate are untouched.

Test

Unit tests for InferenceGenerateRequest: routing key = token_ids joined as decimal string; lossless round-trip (sampling_params.{max_tokens,seed,logprobs,temperature} + model survive deserialize → serialize); same token_ids / different seed → identical routing key; is_stream / get_model.

Supersedes the earlier transparent-proxy load-tracking approach previously on this branch.

🤖 Generated with Claude Code

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39428bd69b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/routers/http/router.rs Outdated
Comment on lines +1696 to +1697
let load_incremented = if policy.name() == "cache_aware" {
worker.increment_load();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate methods before incrementing load

In transparent-proxy mode with the cache_aware policy, unsupported methods (for example OPTIONS/TRACE hitting the fallback handler) now increment the selected worker's load and then return 405 from the method match without hitting any of the new decrement paths. Each such request leaves a permanent in-flight count on that worker, so repeated preflights or bad-method probes can skew cache-aware routing and running_requests metrics until the process restarts; validate the method before incrementing or decrement on this early return.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3d9ac04f58

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/routers/http/router.rs Outdated

let policy = self.policy_registry.get_default_policy();
let request_text = serde_json::to_string(&body).ok();
let request_text = Self::extract_routing_text_from_body(&body);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve full bodies for hash-based policies

When the default policy is consistent_hash or rendezvous_hash and the client relies on body-based stickiness (no session header), this now passes only the extracted text/prompt into select_worker_with_headers. The hash policies call hash_key::extract_hash_key_from_body to read session_params.session_id, user, session_id, or user_id from the JSON body, so a body like {"session_params":{"session_id":"s1"},"prompt":"..."} no longer hashes on s1 and instead falls back to the prompt text, breaking documented body fallback routing for requests with varying prompts. Limit this extraction to cache_aware or keep the serialized body for policies that parse body keys.

Useful? React with 👍 / 👎.

@aoshen02 aoshen02 closed this Jun 29, 2026
@aoshen02 aoshen02 reopened this Jun 29, 2026
@aoshen02 aoshen02 force-pushed the fix/transparent-proxy-load-tracking branch from 3d9ac04 to dd62405 Compare June 29, 2026 09:07

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd62405de6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/server.rs
Comment on lines +239 to +241
state
.router
.route_inference_generate(Some(&headers), &body, None)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Thread model through inference-generate routing

This handler drops the request's model before routing, so InferenceGenerateRequest.model never reaches the router/policy layer. On multi-model deployments, /inference/v1/generate will therefore fall back to the default pool even when the request names a specific model; RouterManager::route_inference_generate repeats the same None forwarding, so both paths need to preserve body.model.

Useful? React with 👍 / 👎.

Comment thread src/routers/http/vllm_pd_router.rs Outdated
Comment on lines +1756 to +1766
let request_json = match serde_json::to_value(body) {
Ok(json) => json,
Err(e) => {
return (
axum::http::StatusCode::INTERNAL_SERVER_ERROR,
format!("Serialization error: {}", e),
)
.into_response()
}
};
self.route_transparent(headers, "/inference/v1/generate", &Method::POST, request_json)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Route inference-generate by token ids, not full JSON

For cache-aware routing, this path hashes the entire serialized JSON instead of the request's token sequence. Two requests with the same token_ids but different sampling_params or other passthrough fields will therefore be treated as unrelated and can land on different workers, defeating the cache-affinity contract established by InferenceGenerateRequest::extract_text_for_routing().

Useful? React with 👍 / 👎.

@aoshen02 aoshen02 changed the title fix(router): track worker load in route_transparent for cache_aware feat(router): register /inference/v1/generate as a typed route Jun 29, 2026
@aoshen02

aoshen02 commented Jun 30, 2026

Copy link
Copy Markdown
Author

Validation

Built this branch and ran it against 4 live vLLM engines. The test sends 256 requests to /inference/v1/generate, structured as 32 distinct prompts × 8 requests that share the same prompt prefix (a best-of-n / multi-sample pattern), and sweeps all three policies.

End to End routing (4 workers, 256 requests):

policy per-worker load engine prefix-cache hit
round_robin 64 / 64 / 64 / 64 ~60%
cache_aware 64 / 64 / 64 / 64 ~86%
consistent_hash 61 / 81 / 57 / 57 ~60%

Where the 86% comes from

Each of the 32 prompts is sent as a group of 8 requests sharing the full prompt prefix. cache_aware routes all 8 of a group to the same worker, so that worker computes the prefix once and reuses it for the other 7 → ceiling = (8−1)/8 = 87.5%, and we measure ~86% (essentially the ceiling). round_robin scatters each group across workers, so the prefix gets recomputed on several engines → only ~60%.

The per-request decision trace confirms the mechanism directly:

match_rate=0.00 matched=0/873   branch=MIN_LOAD  chosen=:15006 tenant=:15006   # new prompt, tree empty → least-loaded worker, seeds the tree
match_rate=1.00 matched=873/873 branch=CACHE_HIT chosen=:15006 tenant=:15006   # later request of same group → full match → sticks to :15006

223 of 256 requests matched at rate 1.0 — and they group into 31 distinct prefix lengths, each appearing ~7×, i.e. exactly the "7 reuses per 8-request group".

Why this shows cache_aware actually doing its job (not degenerate, and not just consistent_hash)

The fix's value is that cache_aware now gets that reuse without hot-spotting. On the old transparent-proxy path it had no load counter, so affinity meant one worker served every shared-prefix request while the others idled (and risked KV-cache OOM). On the typed route it achieves ~86% reuse AND an even 64/64/64/64 load at the same time.

What makes that possible — and distinguishes it from a pure hash policy — is the load-vs-cache arbitration in the trace: when a request's best-prefix-match worker (tenant) is only weakly matched (below cache_threshold = 0.3), it is redirected to the least-loaded worker instead of piling onto the prefix owner. In this run, of the requests that took the min-load branch, 17 of 22 had tenant != chosen_worker — a load override that consistent_hash cannot do (it routes purely by key hash, which is why it skews to 81 vs 57 with no rebalancing).

vLLM's disaggregated engine serves generation at /inference/v1/generate
(the /inference/v1/ prefix is vendor-private). The router has no first-class
typed route for it, so it falls through to the transparent proxy. On the
transparent path cache_aware never tracks worker load, so it degrades to
single-worker prefix affinity — one engine serves every shared-prefix request
(and can OOM its KV cache) while the others idle.

Register /inference/v1/generate as a typed route so it reuses
route_typed_request's cache_aware routing + worker load tracking + retry +
circuit breaker + metrics, identical to /generate.

The typed path forwards by re-serializing the request struct, so a new,
lossless InferenceGenerateRequest models only what the router reads
(token_ids for the routing key, stream, model) and passes everything else
through with #[serde(flatten)] — notably the nested sampling_params, so
seed/max_tokens/logprobs reach the engine unchanged. GenerateRequest and
/generate are untouched.

Adds unit tests pinning the routing key and the lossless round-trip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: aoshen02 <aoshen@inferact.ai>
@aoshen02 aoshen02 force-pushed the fix/transparent-proxy-load-tracking branch from dd62405 to f81fdeb Compare June 30, 2026 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cache_aware degenerates to a single worker on transparent-proxy endpoints (route_transparent never tracks worker load)

2 participants