feat: automatic retry and failover for rate-limited LLM requests#733
feat: automatic retry and failover for rate-limited LLM requests#733raheelshahzad wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Thanks a lot for putting this change together @raheelshahzad . Please join our discord channel too. Overall looks good!
Left some comments in the PR and have some additional suggestions/comments on overall change,
- we should do exponential backoff on retries
- how do we ensure that we have not reached request timeout
- max_retries should be defined somewhere in config.yaml probably not in this PR but we should let developers define that var
- this code change needs an update to docs
- I think we should allow retry to same provider or at least let developers define if they want to retry to different provider. Consider following example,
model_providers:
- model: openai/gpt-4o
base_url: https://dsna-oai.openai.azure.com
access_key: $OPENAI_API_KEY
retry_on_ratelimit: true # new feature
retry_to_same_provider: true # this flag should only allow retry to same provider otherwise we should retry randomly to all models
- model: openai/gpt-5
base_url: https://dsna-oai.openai.azure.com
access_key: $OPENAI_API_KEY
| self.providers.iter().find_map(|(key, provider)| { | ||
| if provider.internal != Some(true) | ||
| && provider.name != current_name | ||
| && key == &provider.name | ||
| { | ||
| Some(Arc::clone(provider)) | ||
| } else { | ||
| None | ||
| } | ||
| }) |
There was a problem hiding this comment.
should pick random model
| if res.status() == StatusCode::TOO_MANY_REQUESTS && attempts < max_attempts { | ||
| let providers = llm_providers.read().await; | ||
| if let Some(provider) = providers.get(¤t_resolved_model) { | ||
| if provider.retry_on_ratelimit == Some(true) { | ||
| if let Some(alt_provider) = providers.get_alternative(¤t_resolved_model) { | ||
| info!( | ||
| request_id = %request_id, | ||
| current_model = %current_resolved_model, | ||
| alt_model = %alt_provider.name, | ||
| "429 received, retrying with alternative model" | ||
| ); | ||
| current_resolved_model = alt_provider.name.clone(); | ||
| continue; | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
we need to add exponential backoff
| let mut current_resolved_model = resolved_model.clone(); | ||
| let mut current_client_request = client_request; | ||
| let mut attempts = 0; | ||
| let max_attempts = 2; // Original + 1 retry |
There was a problem hiding this comment.
this should be configurable
| ); | ||
| // Capture start time right before sending request to upstream | ||
| let request_start_time = std::time::Instant::now(); | ||
| let _request_start_system_time = std::time::SystemTime::now(); |
|
I looked through envoy retry semantics https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto#envoy-v3-api-field-config-route-v3-routeaction-retry-policy I think we should lean toward this design for retries. We don't have to implement this completely but we should implement bare minimal but following similar semantics / config, thoughts? |
d1aa3ac to
ca903d2
Compare
raheelshahzad
left a comment
There was a problem hiding this comment.
- Exponential backoff with configurable base and max intervals.
- Configurable
max_retries. retry_to_same_provideroption.- Random alternative selection when failing over to a different model.
- Documentation updates in the reference configuration.
- Comprehensive unit tests for all the above.
|
Thanks a lot Raheel for continuing to make plano better. We are getting there. This may be a slightly better way to specify retries, model_providers:
- model: openai/gpt-4o
access_key: $OPENAI_API_KEY
default: true
retry_policy:
num_retries: 2
# retry_on: [429] # default
# back_off:
# base_interval: 25ms # default
# max_interval: 250ms # default (10x base)
# failover:
# strategy: same_provider # default
# Need more control
- model: anthropic/claude-sonnet-4-0
access_key: $ANTHROPIC_API_KEY
retry_policy:
num_retries: 3
failover:
strategy: any
# Full control
- model: openai/gpt-4o-mini
access_key: $OPENAI_API_KEY
retry_policy:
num_retries: 2
retry_on: [429, 503]
back_off:
base_interval: 100ms
max_interval: 2000ms
failover:
providers:
- anthropic/claude-sonnet-4-0
# No retries (default, just omit retry_policy)
- model: mistral/ministral-3b-latest
access_key: $MISTRAL_API_KEY |
I like this developer experience, and would love to see an updated PR about it. This would help with free-tier GPU traffic shaping and a very useful feature for coding agents. |
ca903d2 to
1384982
Compare
1384982 to
d569d4f
Compare
Implement a retry-on-ratelimit system for the Plano gateway that automatically retries failed LLM requests (429, 503, timeouts) across alternative providers with intelligent provider selection. Core modules (crates/common/src/retry/): - orchestrator: retry loop with budget tracking and attempt management - provider_selector: weighted selection excluding blocked providers - error_detector: classifies responses into retryable error categories - backoff: exponential backoff with jitter and Retry-After support - retry_after_state: per-provider rate-limit cooldown tracking - latency_block_state: high-latency provider temporary exclusion - latency_trigger: consecutive slow-response counter - validation: configuration validation with cross-field checks - error_response: structured error responses when retries exhausted Three phases: P0 (core retry + backoff), P1 (Retry-After + fallback models + timeout), P2 (proactive high-latency failover). Tests follow in a separate PR.
…elimit Add 302 property-based unit tests (proptest, 100+ iterations each) and 13 integration test scenarios covering all retry behaviors. Unit tests cover: - Configuration round-trip parsing, defaults, and validation - Status code range expansion and error classification - Exponential backoff formula, bounds, and scope filtering - Provider selection strategy correctness and fallback ordering - Retry-After state scope behavior and max expiration updates - Cooldown exclusion invariants and initial selection cooldown - Bounded retry (max_attempts + budget enforcement) - Request preservation across retries - Latency trigger sliding window and block state management - Timeout vs high-latency precedence - Error response detail completeness Integration tests (tests/e2e/): - IT-1 through IT-13 covering 429/503 retry, exhaustion, backoff, fallback priority, Retry-After honoring, timeout retry, high-latency failover, streaming preservation, and body preservation
d569d4f to
98bf024
Compare
|
Question about Phase 2 (Provider_List fallback) behavior: When Consider this scenario: a user configures The result is a request that was routed for complex code generation silently being served by a model that is not capable of fulfilling it, wasting both time and tokens with no useful output. Would it make sense to add an option (e.g. |
|
I noticed this PR hasn't been updated for a while. Do you still have time to continue development? If you're currently unavailable, would you be willing to let me continue development based on your previous work? I will retain your signature and author information. |
|
@TroyMitchell911 yes would love to see you pick it up. It would be nice if you can split this work into smaller PRs. Please look at the issue and let me know if you have any questions. |
Summary
Adds a retry-on-ratelimit system to the Plano gateway that automatically retries failed LLM requests (429, 503, timeouts) across alternative providers with intelligent selection.
Structure (2 commits)
Commit 1 — Production code (~4k lines)
Core retry engine in
crates/common/src/retry/:orchestrator: retry loop with budget trackingprovider_selector: weighted selection excluding blocked providerserror_detector: classifies responses into retryable categoriesbackoff: exponential backoff with jitter + Retry-After supportretry_after_state: per-provider rate-limit cooldown trackinglatency_block_state: high-latency provider temporary exclusionlatency_trigger: consecutive slow-response countervalidation: config validation with cross-field checkserror_response: structured error responses when retries exhaustedThree phases: P0 (core retry + backoff), P1 (Retry-After + fallback models + timeout), P2 (proactive high-latency failover).
Commit 2 — Tests (~10.9k lines)
proptest, 100+ iterations each)