Skip to content

feat: automatic retry and failover for rate-limited LLM requests#733

Open
raheelshahzad wants to merge 2 commits into
katanemo:mainfrom
raheelshahzad:feat/retry-on-ratelimit
Open

feat: automatic retry and failover for rate-limited LLM requests#733
raheelshahzad wants to merge 2 commits into
katanemo:mainfrom
raheelshahzad:feat/retry-on-ratelimit

Conversation

@raheelshahzad
Copy link
Copy Markdown
Collaborator

@raheelshahzad raheelshahzad commented Feb 10, 2026

Summary

Adds a retry-on-ratelimit system to the Plano gateway that automatically retries failed LLM requests (429, 503, timeouts) across alternative providers with intelligent selection.

Structure (2 commits)

Commit 1 — Production code (~4k lines)
Core retry engine in crates/common/src/retry/:

  • orchestrator: retry loop with budget tracking
  • provider_selector: weighted selection excluding blocked providers
  • error_detector: classifies responses into retryable categories
  • backoff: exponential backoff with jitter + Retry-After support
  • retry_after_state: per-provider rate-limit cooldown tracking
  • latency_block_state: high-latency provider temporary exclusion
  • latency_trigger: consecutive slow-response counter
  • validation: config validation with cross-field checks
  • error_response: structured error responses when retries exhausted

Three phases: P0 (core retry + backoff), P1 (Retry-After + fallback models + timeout), P2 (proactive high-latency failover).

Commit 2 — Tests (~10.9k lines)

  • 302 property-based unit tests (proptest, 100+ iterations each)
  • 13 integration test scenarios (IT-1 through IT-13)
  • Covers all retry behaviors: 429/503, exhaustion, backoff, fallback priority, Retry-After, timeout, high-latency failover, streaming, body preservation

Copy link
Copy Markdown
Contributor

@adilhafeez adilhafeez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for putting this change together @raheelshahzad . Please join our discord channel too. Overall looks good!

Left some comments in the PR and have some additional suggestions/comments on overall change,

  • we should do exponential backoff on retries
  • how do we ensure that we have not reached request timeout
  • max_retries should be defined somewhere in config.yaml probably not in this PR but we should let developers define that var
  • this code change needs an update to docs
  • I think we should allow retry to same provider or at least let developers define if they want to retry to different provider. Consider following example,
model_providers:
  - model: openai/gpt-4o
    base_url: https://dsna-oai.openai.azure.com
    access_key: $OPENAI_API_KEY
    retry_on_ratelimit: true # new feature
    retry_to_same_provider: true # this flag should only allow retry to same provider otherwise we should retry randomly to all models

  - model: openai/gpt-5
    base_url: https://dsna-oai.openai.azure.com
    access_key: $OPENAI_API_KEY

Comment thread crates/common/src/llm_providers.rs Outdated
Comment on lines +95 to +104
self.providers.iter().find_map(|(key, provider)| {
if provider.internal != Some(true)
&& provider.name != current_name
&& key == &provider.name
{
Some(Arc::clone(provider))
} else {
None
}
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should pick random model

Comment thread crates/brightstaff/src/handlers/llm.rs Outdated
Comment on lines 403 to 419
if res.status() == StatusCode::TOO_MANY_REQUESTS && attempts < max_attempts {
let providers = llm_providers.read().await;
if let Some(provider) = providers.get(&current_resolved_model) {
if provider.retry_on_ratelimit == Some(true) {
if let Some(alt_provider) = providers.get_alternative(&current_resolved_model) {
info!(
request_id = %request_id,
current_model = %current_resolved_model,
alt_model = %alt_provider.name,
"429 received, retrying with alternative model"
);
current_resolved_model = alt_provider.name.clone();
continue;
}
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to add exponential backoff

Comment thread crates/brightstaff/src/handlers/llm.rs Outdated
let mut current_resolved_model = resolved_model.clone();
let mut current_client_request = client_request;
let mut attempts = 0;
let max_attempts = 2; // Original + 1 retry
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be configurable

);
// Capture start time right before sending request to upstream
let request_start_time = std::time::Instant::now();
let _request_start_system_time = std::time::SystemTime::now();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code?

@adilhafeez
Copy link
Copy Markdown
Contributor

adilhafeez commented Feb 10, 2026

I looked through envoy retry semantics https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto#envoy-v3-api-field-config-route-v3-routeaction-retry-policy

I think we should lean toward this design for retries. We don't have to implement this completely but we should implement bare minimal but following similar semantics / config, thoughts?

@raheelshahzad raheelshahzad force-pushed the feat/retry-on-ratelimit branch from d1aa3ac to ca903d2 Compare February 12, 2026 04:08
Copy link
Copy Markdown
Collaborator Author

@raheelshahzad raheelshahzad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Exponential backoff with configurable base and max intervals.
  2. Configurable max_retries.
  3. retry_to_same_provider option.
  4. Random alternative selection when failing over to a different model.
  5. Documentation updates in the reference configuration.
  6. Comprehensive unit tests for all the above.

@adilhafeez
Copy link
Copy Markdown
Contributor

adilhafeez commented Feb 12, 2026

Thanks a lot Raheel for continuing to make plano better. We are getting there.

This may be a slightly better way to specify retries,

  model_providers:
    - model: openai/gpt-4o
      access_key: $OPENAI_API_KEY
      default: true
      retry_policy:
        num_retries: 2
        # retry_on: [429]             # default
        # back_off:
        #   base_interval: 25ms       # default
        #   max_interval: 250ms       # default (10x base)
        # failover:
        #   strategy: same_provider   # default

    # Need more control
    - model: anthropic/claude-sonnet-4-0
      access_key: $ANTHROPIC_API_KEY
      retry_policy:
        num_retries: 3
        failover:
          strategy: any

    # Full control
    - model: openai/gpt-4o-mini
      access_key: $OPENAI_API_KEY
      retry_policy:
        num_retries: 2
        retry_on: [429, 503]
        back_off:
          base_interval: 100ms
          max_interval: 2000ms
        failover:
          providers:
            - anthropic/claude-sonnet-4-0

    # No retries (default, just omit retry_policy)
    - model: mistral/ministral-3b-latest
      access_key: $MISTRAL_API_KEY

@salmanap
Copy link
Copy Markdown
Contributor

salmanap commented Mar 3, 2026

Thanks a lot Raheel for continuing to make plano better. We are getting there.

This may be a slightly better way to specify retries,

  model_providers:
    - model: openai/gpt-4o
      access_key: $OPENAI_API_KEY
      default: true
      retry_policy:
        num_retries: 2
        # retry_on: [429]             # default
        # back_off:
        #   base_interval: 25ms       # default
        #   max_interval: 250ms       # default (10x base)
        # failover:
        #   strategy: same_provider   # default

    # Need more control
    - model: anthropic/claude-sonnet-4-0
      access_key: $ANTHROPIC_API_KEY
      retry_policy:
        num_retries: 3
        failover:
          strategy: any

    # Full control
    - model: openai/gpt-4o-mini
      access_key: $OPENAI_API_KEY
      retry_policy:
        num_retries: 2
        retry_on: [429, 503]
        back_off:
          base_interval: 100ms
          max_interval: 2000ms
        failover:
          providers:
            - anthropic/claude-sonnet-4-0

    # No retries (default, just omit retry_policy)
    - model: mistral/ministral-3b-latest
      access_key: $MISTRAL_API_KEY

I like this developer experience, and would love to see an updated PR about it. This would help with free-tier GPU traffic shaping and a very useful feature for coding agents.

@raheelshahzad raheelshahzad force-pushed the feat/retry-on-ratelimit branch from ca903d2 to 1384982 Compare March 9, 2026 00:43
@raheelshahzad raheelshahzad changed the title feat: add support for retrying LLM requests on 429 ratelimits (#697) feat: automatic retry and failover for rate-limited LLM requests Mar 9, 2026
@raheelshahzad raheelshahzad force-pushed the feat/retry-on-ratelimit branch from 1384982 to d569d4f Compare March 9, 2026 00:45
Implement a retry-on-ratelimit system for the Plano gateway that
automatically retries failed LLM requests (429, 503, timeouts) across
alternative providers with intelligent provider selection.

Core modules (crates/common/src/retry/):
- orchestrator: retry loop with budget tracking and attempt management
- provider_selector: weighted selection excluding blocked providers
- error_detector: classifies responses into retryable error categories
- backoff: exponential backoff with jitter and Retry-After support
- retry_after_state: per-provider rate-limit cooldown tracking
- latency_block_state: high-latency provider temporary exclusion
- latency_trigger: consecutive slow-response counter
- validation: configuration validation with cross-field checks
- error_response: structured error responses when retries exhausted

Three phases: P0 (core retry + backoff), P1 (Retry-After + fallback
models + timeout), P2 (proactive high-latency failover).

Tests follow in a separate PR.
…elimit

Add 302 property-based unit tests (proptest, 100+ iterations each) and
13 integration test scenarios covering all retry behaviors.

Unit tests cover:
- Configuration round-trip parsing, defaults, and validation
- Status code range expansion and error classification
- Exponential backoff formula, bounds, and scope filtering
- Provider selection strategy correctness and fallback ordering
- Retry-After state scope behavior and max expiration updates
- Cooldown exclusion invariants and initial selection cooldown
- Bounded retry (max_attempts + budget enforcement)
- Request preservation across retries
- Latency trigger sliding window and block state management
- Timeout vs high-latency precedence
- Error response detail completeness

Integration tests (tests/e2e/):
- IT-1 through IT-13 covering 429/503 retry, exhaustion, backoff,
  fallback priority, Retry-After honoring, timeout retry, high-latency
  failover, streaming preservation, and body preservation
@raheelshahzad raheelshahzad force-pushed the feat/retry-on-ratelimit branch from d569d4f to 98bf024 Compare March 9, 2026 01:45
@TroyMitchell911
Copy link
Copy Markdown

Question about Phase 2 (Provider_List fallback) behavior:

When fallback_models is exhausted and the strategy is SameProvider or DifferentProvider, the selector falls through to Phase 2 which iterates over all_providers with only the strategy filter applied. There is no constraint tying the retry candidates back to the original routing intent.

Consider this scenario: a user configures opus-4.6-source-1 and opus-4.6-source-2 under the same provider for code generation tasks, with fallback_models: [opus-4.6-source-2] and strategy: different_provider. If both sources are down, Phase 2 would pick up any available model from the global Provider_List that matches the strategy filter — potentially a much weaker model like haiku that happens to share the same provider prefix.

The result is a request that was routed for complex code generation silently being served by a model that is not capable of fulfilling it, wasting both time and tokens with no useful output.

Would it make sense to add an option (e.g. fallback_scope: "fallback_list_only" | "all_providers", defaulting to fallback_list_only) so that operators can opt out of the Phase 2 global fallback? That way, when the explicit fallback chain is exhausted, the system returns an error immediately rather than degrading to an unsuitable model.

@TroyMitchell911
Copy link
Copy Markdown

I noticed this PR hasn't been updated for a while. Do you still have time to continue development? If you're currently unavailable, would you be willing to let me continue development based on your previous work? I will retain your signature and author information.

@adilhafeez
Copy link
Copy Markdown
Contributor

@TroyMitchell911 yes would love to see you pick it up. It would be nice if you can split this work into smaller PRs. Please look at the issue and let me know if you have any questions.

@TroyMitchell911
Copy link
Copy Markdown

TroyMitchell911 commented Apr 28, 2026

@TroyMitchell911 yes would love to see you pick it up. It would be nice if you can split this work into smaller PRs. Please look at the issue and let me know if you have any questions.

ready for review #926 #927 #928 #929 #930

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants