feat(router): register /inference/v1/generate as a typed route#180
feat(router): register /inference/v1/generate as a typed route#180aoshen02 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 39428bd69b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let load_incremented = if policy.name() == "cache_aware" { | ||
| worker.increment_load(); |
There was a problem hiding this comment.
Validate methods before incrementing load
In transparent-proxy mode with the cache_aware policy, unsupported methods (for example OPTIONS/TRACE hitting the fallback handler) now increment the selected worker's load and then return 405 from the method match without hitting any of the new decrement paths. Each such request leaves a permanent in-flight count on that worker, so repeated preflights or bad-method probes can skew cache-aware routing and running_requests metrics until the process restarts; validate the method before incrementing or decrement on this early return.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3d9ac04f58
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| let policy = self.policy_registry.get_default_policy(); | ||
| let request_text = serde_json::to_string(&body).ok(); | ||
| let request_text = Self::extract_routing_text_from_body(&body); |
There was a problem hiding this comment.
Preserve full bodies for hash-based policies
When the default policy is consistent_hash or rendezvous_hash and the client relies on body-based stickiness (no session header), this now passes only the extracted text/prompt into select_worker_with_headers. The hash policies call hash_key::extract_hash_key_from_body to read session_params.session_id, user, session_id, or user_id from the JSON body, so a body like {"session_params":{"session_id":"s1"},"prompt":"..."} no longer hashes on s1 and instead falls back to the prompt text, breaking documented body fallback routing for requests with varying prompts. Limit this extraction to cache_aware or keep the serialized body for policies that parse body keys.
Useful? React with 👍 / 👎.
3d9ac04 to
dd62405
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dd62405de6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| state | ||
| .router | ||
| .route_inference_generate(Some(&headers), &body, None) |
There was a problem hiding this comment.
Thread
model through inference-generate routing
This handler drops the request's model before routing, so InferenceGenerateRequest.model never reaches the router/policy layer. On multi-model deployments, /inference/v1/generate will therefore fall back to the default pool even when the request names a specific model; RouterManager::route_inference_generate repeats the same None forwarding, so both paths need to preserve body.model.
Useful? React with 👍 / 👎.
| let request_json = match serde_json::to_value(body) { | ||
| Ok(json) => json, | ||
| Err(e) => { | ||
| return ( | ||
| axum::http::StatusCode::INTERNAL_SERVER_ERROR, | ||
| format!("Serialization error: {}", e), | ||
| ) | ||
| .into_response() | ||
| } | ||
| }; | ||
| self.route_transparent(headers, "/inference/v1/generate", &Method::POST, request_json) |
There was a problem hiding this comment.
Route inference-generate by token ids, not full JSON
For cache-aware routing, this path hashes the entire serialized JSON instead of the request's token sequence. Two requests with the same token_ids but different sampling_params or other passthrough fields will therefore be treated as unrelated and can land on different workers, defeating the cache-affinity contract established by InferenceGenerateRequest::extract_text_for_routing().
Useful? React with 👍 / 👎.
ValidationBuilt this branch and ran it against 4 live vLLM engines. The test sends 256 requests to End to End routing (4 workers, 256 requests):
Where the 86% comes fromEach of the 32 prompts is sent as a group of 8 requests sharing the full prompt prefix. The per-request decision trace confirms the mechanism directly: 223 of 256 requests matched at rate 1.0 — and they group into 31 distinct prefix lengths, each appearing ~7×, i.e. exactly the "7 reuses per 8-request group". Why this shows cache_aware actually doing its job (not degenerate, and not just consistent_hash)The fix's value is that cache_aware now gets that reuse without hot-spotting. On the old transparent-proxy path it had no load counter, so affinity meant one worker served every shared-prefix request while the others idled (and risked KV-cache OOM). On the typed route it achieves ~86% reuse AND an even 64/64/64/64 load at the same time. What makes that possible — and distinguishes it from a pure hash policy — is the load-vs-cache arbitration in the trace: when a request's best-prefix-match worker ( |
vLLM's disaggregated engine serves generation at /inference/v1/generate (the /inference/v1/ prefix is vendor-private). The router has no first-class typed route for it, so it falls through to the transparent proxy. On the transparent path cache_aware never tracks worker load, so it degrades to single-worker prefix affinity — one engine serves every shared-prefix request (and can OOM its KV cache) while the others idle. Register /inference/v1/generate as a typed route so it reuses route_typed_request's cache_aware routing + worker load tracking + retry + circuit breaker + metrics, identical to /generate. The typed path forwards by re-serializing the request struct, so a new, lossless InferenceGenerateRequest models only what the router reads (token_ids for the routing key, stream, model) and passes everything else through with #[serde(flatten)] — notably the nested sampling_params, so seed/max_tokens/logprobs reach the engine unchanged. GenerateRequest and /generate are untouched. Adds unit tests pinning the routing key and the lossless round-trip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: aoshen02 <aoshen@inferact.ai>
dd62405 to
f81fdeb
Compare
Fixes #179.
Problem
vLLM's disaggregated engine serves generation at
/inference/v1/generate(the/inference/v1/prefix is vendor-private). The router has no first-class typed route for it, so it falls through to the transparent proxy. On that pathcache_awarenever tracks worker load, so it degrades to single-worker prefix affinity — one engine serves every shared-prefix request (and can OOM its KV cache) while the others idle.Fix
Register
/inference/v1/generateas a typed route, so it reusesroute_typed_request'scache_awarerouting + worker load tracking + retry + circuit breaker + metrics — identical to/generate.The typed path forwards by re-serializing the request struct, so a new lossless
InferenceGenerateRequestmodels only what the router reads —token_ids(the cache_aware routing key),stream, andmodel(forget_model, consistent with the other typed endpoints) — and passes everything else through with#[serde(flatten)]. The nestedsampling_paramsis forwarded untouched, soseed/max_tokens/logprobsreach the engine unchanged (the router never interprets sampling params; mirroring vLLM's ~100-fieldSamplingParamswould be pure drift risk).GenerateRequestand/generateare untouched.Test
Unit tests for
InferenceGenerateRequest: routing key =token_idsjoined as decimal string; lossless round-trip (sampling_params.{max_tokens,seed,logprobs,temperature}+modelsurvive deserialize → serialize); sametoken_ids/ differentseed→ identical routing key;is_stream/get_model.🤖 Generated with Claude Code