feat(router): register /inference/v1/generate as a typed route by aoshen02 · Pull Request #180 · vllm-project/router

aoshen02 · 2026-05-31T08:56:59Z

Fixes #179.

Problem

vLLM's disaggregated engine serves generation at /inference/v1/generate (the /inference/v1/ prefix is vendor-private). The router has no first-class typed route for it, so it falls through to the transparent proxy. On that path cache_aware never tracks worker load, so it degrades to single-worker prefix affinity — one engine serves every shared-prefix request (and can OOM its KV cache) while the others idle.

Fix

Register /inference/v1/generate as a typed route, so it reuses route_typed_request's cache_aware routing + worker load tracking + retry + circuit breaker + metrics — identical to /generate.

The typed path forwards by re-serializing the request struct, so a new lossless InferenceGenerateRequest models only what the router reads — token_ids (the cache_aware routing key), stream, and model (for get_model, consistent with the other typed endpoints) — and passes everything else through with #[serde(flatten)]. The nested sampling_params is forwarded untouched, so seed / max_tokens / logprobs reach the engine unchanged (the router never interprets sampling params; mirroring vLLM's ~100-field SamplingParams would be pure drift risk). GenerateRequest and /generate are untouched.

Test

Unit tests for InferenceGenerateRequest: routing key = token_ids joined as decimal string; lossless round-trip (sampling_params.{max_tokens,seed,logprobs,temperature} + model survive deserialize → serialize); same token_ids / different seed → identical routing key; is_stream / get_model.

Supersedes the earlier transparent-proxy load-tracking approach previously on this branch.

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39428bd69b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-31T08:59:54Z

+        let load_incremented = if policy.name() == "cache_aware" {
+            worker.increment_load();


Validate methods before incrementing load

In transparent-proxy mode with the cache_aware policy, unsupported methods (for example OPTIONS/TRACE hitting the fallback handler) now increment the selected worker's load and then return 405 from the method match without hitting any of the new decrement paths. Each such request leaves a permanent in-flight count on that worker, so repeated preflights or bad-method probes can skew cache-aware routing and running_requests metrics until the process restarts; validate the method before incrementing or decrement on this early return.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3d9ac04f58

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-20T14:32:30Z


        let policy = self.policy_registry.get_default_policy();
-        let request_text = serde_json::to_string(&body).ok();
+        let request_text = Self::extract_routing_text_from_body(&body);


Preserve full bodies for hash-based policies

When the default policy is consistent_hash or rendezvous_hash and the client relies on body-based stickiness (no session header), this now passes only the extracted text/prompt into select_worker_with_headers. The hash policies call hash_key::extract_hash_key_from_body to read session_params.session_id, user, session_id, or user_id from the JSON body, so a body like {"session_params":{"session_id":"s1"},"prompt":"..."} no longer hashes on s1 and instead falls back to the prompt text, breaking documented body fallback routing for requests with varying prompts. Limit this extraction to cache_aware or keep the serialized body for policies that parse body keys.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd62405de6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-29T09:16:41Z

+    state
+        .router
+        .route_inference_generate(Some(&headers), &body, None)


Thread model through inference-generate routing

This handler drops the request's model before routing, so InferenceGenerateRequest.model never reaches the router/policy layer. On multi-model deployments, /inference/v1/generate will therefore fall back to the default pool even when the request names a specific model; RouterManager::route_inference_generate repeats the same None forwarding, so both paths need to preserve body.model.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-29T09:16:41Z

+        let request_json = match serde_json::to_value(body) {
+            Ok(json) => json,
+            Err(e) => {
+                return (
+                    axum::http::StatusCode::INTERNAL_SERVER_ERROR,
+                    format!("Serialization error: {}", e),
+                )
+                    .into_response()
+            }
+        };
+        self.route_transparent(headers, "/inference/v1/generate", &Method::POST, request_json)


Route inference-generate by token ids, not full JSON

For cache-aware routing, this path hashes the entire serialized JSON instead of the request's token sequence. Two requests with the same token_ids but different sampling_params or other passthrough fields will therefore be treated as unrelated and can land on different workers, defeating the cache-affinity contract established by InferenceGenerateRequest::extract_text_for_routing().

Useful? React with 👍 / 👎.

aoshen02 · 2026-06-30T00:07:54Z

Validation

Built this branch and ran it against 4 live vLLM engines. The test sends 256 requests to /inference/v1/generate, structured as 32 distinct prompts × 8 requests that share the same prompt prefix (a best-of-n / multi-sample pattern), and sweeps all three policies.

End to End routing (4 workers, 256 requests):

policy	per-worker load	engine prefix-cache hit
`round_robin`	64 / 64 / 64 / 64	~60%
`cache_aware`	64 / 64 / 64 / 64	~86%
`consistent_hash`	61 / 81 / 57 / 57	~60%

Where the 86% comes from

Each of the 32 prompts is sent as a group of 8 requests sharing the full prompt prefix. cache_aware routes all 8 of a group to the same worker, so that worker computes the prefix once and reuses it for the other 7 → ceiling = (8−1)/8 = 87.5%, and we measure ~86% (essentially the ceiling). round_robin scatters each group across workers, so the prefix gets recomputed on several engines → only ~60%.

The per-request decision trace confirms the mechanism directly:

match_rate=0.00 matched=0/873   branch=MIN_LOAD  chosen=:15006 tenant=:15006   # new prompt, tree empty → least-loaded worker, seeds the tree
match_rate=1.00 matched=873/873 branch=CACHE_HIT chosen=:15006 tenant=:15006   # later request of same group → full match → sticks to :15006

223 of 256 requests matched at rate 1.0 — and they group into 31 distinct prefix lengths, each appearing ~7×, i.e. exactly the "7 reuses per 8-request group".

Why this shows cache_aware actually doing its job (not degenerate, and not just consistent_hash)

The fix's value is that cache_aware now gets that reuse without hot-spotting. On the old transparent-proxy path it had no load counter, so affinity meant one worker served every shared-prefix request while the others idled (and risked KV-cache OOM). On the typed route it achieves ~86% reuse AND an even 64/64/64/64 load at the same time.

What makes that possible — and distinguishes it from a pure hash policy — is the load-vs-cache arbitration in the trace: when a request's best-prefix-match worker (tenant) is only weakly matched (below cache_threshold = 0.3), it is redirected to the least-loaded worker instead of piling onto the prefix owner. In this run, of the requests that took the min-load branch, 17 of 22 had tenant != chosen_worker — a load override that consistent_hash cannot do (it routes purely by key hash, which is why it skews to 81 vs 57 with no rebalancing).

vLLM's disaggregated engine serves generation at /inference/v1/generate (the /inference/v1/ prefix is vendor-private). The router has no first-class typed route for it, so it falls through to the transparent proxy. On the transparent path cache_aware never tracks worker load, so it degrades to single-worker prefix affinity — one engine serves every shared-prefix request (and can OOM its KV cache) while the others idle. Register /inference/v1/generate as a typed route so it reuses route_typed_request's cache_aware routing + worker load tracking + retry + circuit breaker + metrics, identical to /generate. The typed path forwards by re-serializing the request struct, so a new, lossless InferenceGenerateRequest models only what the router reads (token_ids for the routing key, stream, model) and passes everything else through with #[serde(flatten)] — notably the nested sampling_params, so seed/max_tokens/logprobs reach the engine unchanged. GenerateRequest and /generate are untouched. Adds unit tests pinning the routing key and the lossless round-trip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: aoshen02 <aoshen@inferact.ai>

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 20, 2026

View reviewed changes

aoshen02 closed this Jun 29, 2026

aoshen02 reopened this Jun 29, 2026

aoshen02 force-pushed the fix/transparent-proxy-load-tracking branch from 3d9ac04 to dd62405 Compare June 29, 2026 09:07

chatgpt-codex-connector Bot reviewed Jun 29, 2026

View reviewed changes

aoshen02 changed the title ~~fix(router): track worker load in route_transparent for cache_aware~~ feat(router): register /inference/v1/generate as a typed route Jun 29, 2026

aoshen02 force-pushed the fix/transparent-proxy-load-tracking branch from dd62405 to f81fdeb Compare June 30, 2026 00:17

BugenZhao approved these changes Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(router): register /inference/v1/generate as a typed route#180

feat(router): register /inference/v1/generate as a typed route#180
aoshen02 wants to merge 1 commit into
vllm-project:mainfrom
aoshen02:fix/transparent-proxy-load-tracking

aoshen02 commented May 31, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 29, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 29, 2026

Uh oh!

aoshen02 commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		let load_incremented = if policy.name() == "cache_aware" {
		worker.increment_load();

Uh oh!

Conversation

aoshen02 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Test

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

aoshen02 commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation

Where the 86% comes from

Why this shows cache_aware actually doing its job (not degenerate, and not just consistent_hash)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aoshen02 commented May 31, 2026 •

edited

Loading

aoshen02 commented Jun 30, 2026 •

edited

Loading