feat: automatic retry and failover for rate-limited LLM requests by raheelshahzad · Pull Request #733 · katanemo/plano

raheelshahzad · 2026-02-10T05:00:28Z

Summary

Adds a retry-on-ratelimit system to the Plano gateway that automatically retries failed LLM requests (429, 503, timeouts) across alternative providers with intelligent selection.

Structure (2 commits)

Commit 1 — Production code (~4k lines)
Core retry engine in crates/common/src/retry/:

orchestrator: retry loop with budget tracking
provider_selector: weighted selection excluding blocked providers
error_detector: classifies responses into retryable categories
backoff: exponential backoff with jitter + Retry-After support
retry_after_state: per-provider rate-limit cooldown tracking
latency_block_state: high-latency provider temporary exclusion
latency_trigger: consecutive slow-response counter
validation: config validation with cross-field checks
error_response: structured error responses when retries exhausted

Three phases: P0 (core retry + backoff), P1 (Retry-After + fallback models + timeout), P2 (proactive high-latency failover).

Commit 2 — Tests (~10.9k lines)

302 property-based unit tests (proptest, 100+ iterations each)
13 integration test scenarios (IT-1 through IT-13)
Covers all retry behaviors: 429/503, exhaustion, backoff, fallback priority, Retry-After, timeout, high-latency failover, streaming, body preservation

adilhafeez

Thanks a lot for putting this change together @raheelshahzad . Please join our discord channel too. Overall looks good!

Left some comments in the PR and have some additional suggestions/comments on overall change,

we should do exponential backoff on retries
how do we ensure that we have not reached request timeout
max_retries should be defined somewhere in config.yaml probably not in this PR but we should let developers define that var
this code change needs an update to docs
I think we should allow retry to same provider or at least let developers define if they want to retry to different provider. Consider following example,

model_providers:
  - model: openai/gpt-4o
    base_url: https://dsna-oai.openai.azure.com
    access_key: $OPENAI_API_KEY
    retry_on_ratelimit: true # new feature
    retry_to_same_provider: true # this flag should only allow retry to same provider otherwise we should retry randomly to all models

  - model: openai/gpt-5
    base_url: https://dsna-oai.openai.azure.com
    access_key: $OPENAI_API_KEY

adilhafeez · 2026-02-10T07:41:40Z

+        self.providers.iter().find_map(|(key, provider)| {
+            if provider.internal != Some(true)
+                && provider.name != current_name
+                && key == &provider.name
+            {
+                Some(Arc::clone(provider))
+            } else {
+                None
+            }
+        })


should pick random model

adilhafeez · 2026-02-10T07:42:51Z

+        if res.status() == StatusCode::TOO_MANY_REQUESTS && attempts < max_attempts {
+            let providers = llm_providers.read().await;
+            if let Some(provider) = providers.get(&current_resolved_model) {
+                if provider.retry_on_ratelimit == Some(true) {
+                    if let Some(alt_provider) = providers.get_alternative(&current_resolved_model) {
+                        info!(
+                            request_id = %request_id,
+                            current_model = %current_resolved_model,
+                            alt_model = %alt_provider.name,
+                            "429 received, retrying with alternative model"
+                        );
+                        current_resolved_model = alt_provider.name.clone();
+                        continue;
+                    }
+                }
+            }
        }


we need to add exponential backoff

adilhafeez · 2026-02-10T07:46:46Z

+    let mut current_resolved_model = resolved_model.clone();
+    let mut current_client_request = client_request;
+    let mut attempts = 0;
+    let max_attempts = 2; // Original + 1 retry


this should be configurable

adilhafeez · 2026-02-10T07:47:15Z

-    );
+    // Capture start time right before sending request to upstream
+    let request_start_time = std::time::Instant::now();
+    let _request_start_system_time = std::time::SystemTime::now();


adilhafeez · 2026-02-10T15:28:25Z

I looked through envoy retry semantics https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto#envoy-v3-api-field-config-route-v3-routeaction-retry-policy

I think we should lean toward this design for retries. We don't have to implement this completely but we should implement bare minimal but following similar semantics / config, thoughts?

raheelshahzad

Exponential backoff with configurable base and max intervals.
Configurable max_retries.
retry_to_same_provider option.
Random alternative selection when failing over to a different model.
Documentation updates in the reference configuration.
Comprehensive unit tests for all the above.

adilhafeez · 2026-02-12T04:59:36Z

Thanks a lot Raheel for continuing to make plano better. We are getting there.

This may be a slightly better way to specify retries,

  model_providers:
    - model: openai/gpt-4o
      access_key: $OPENAI_API_KEY
      default: true
      retry_policy:
        num_retries: 2
        # retry_on: [429]             # default
        # back_off:
        #   base_interval: 25ms       # default
        #   max_interval: 250ms       # default (10x base)
        # failover:
        #   strategy: same_provider   # default

    # Need more control
    - model: anthropic/claude-sonnet-4-0
      access_key: $ANTHROPIC_API_KEY
      retry_policy:
        num_retries: 3
        failover:
          strategy: any

    # Full control
    - model: openai/gpt-4o-mini
      access_key: $OPENAI_API_KEY
      retry_policy:
        num_retries: 2
        retry_on: [429, 503]
        back_off:
          base_interval: 100ms
          max_interval: 2000ms
        failover:
          providers:
            - anthropic/claude-sonnet-4-0

    # No retries (default, just omit retry_policy)
    - model: mistral/ministral-3b-latest
      access_key: $MISTRAL_API_KEY

salmanap · 2026-03-03T21:58:08Z

Thanks a lot Raheel for continuing to make plano better. We are getting there.

This may be a slightly better way to specify retries,

  model_providers:
    - model: openai/gpt-4o
      access_key: $OPENAI_API_KEY
      default: true
      retry_policy:
        num_retries: 2
        # retry_on: [429]             # default
        # back_off:
        #   base_interval: 25ms       # default
        #   max_interval: 250ms       # default (10x base)
        # failover:
        #   strategy: same_provider   # default

    # Need more control
    - model: anthropic/claude-sonnet-4-0
      access_key: $ANTHROPIC_API_KEY
      retry_policy:
        num_retries: 3
        failover:
          strategy: any

    # Full control
    - model: openai/gpt-4o-mini
      access_key: $OPENAI_API_KEY
      retry_policy:
        num_retries: 2
        retry_on: [429, 503]
        back_off:
          base_interval: 100ms
          max_interval: 2000ms
        failover:
          providers:
            - anthropic/claude-sonnet-4-0

    # No retries (default, just omit retry_policy)
    - model: mistral/ministral-3b-latest
      access_key: $MISTRAL_API_KEY

I like this developer experience, and would love to see an updated PR about it. This would help with free-tier GPU traffic shaping and a very useful feature for coding agents.

Implement a retry-on-ratelimit system for the Plano gateway that automatically retries failed LLM requests (429, 503, timeouts) across alternative providers with intelligent provider selection. Core modules (crates/common/src/retry/): - orchestrator: retry loop with budget tracking and attempt management - provider_selector: weighted selection excluding blocked providers - error_detector: classifies responses into retryable error categories - backoff: exponential backoff with jitter and Retry-After support - retry_after_state: per-provider rate-limit cooldown tracking - latency_block_state: high-latency provider temporary exclusion - latency_trigger: consecutive slow-response counter - validation: configuration validation with cross-field checks - error_response: structured error responses when retries exhausted Three phases: P0 (core retry + backoff), P1 (Retry-After + fallback models + timeout), P2 (proactive high-latency failover). Tests follow in a separate PR.

…elimit Add 302 property-based unit tests (proptest, 100+ iterations each) and 13 integration test scenarios covering all retry behaviors. Unit tests cover: - Configuration round-trip parsing, defaults, and validation - Status code range expansion and error classification - Exponential backoff formula, bounds, and scope filtering - Provider selection strategy correctness and fallback ordering - Retry-After state scope behavior and max expiration updates - Cooldown exclusion invariants and initial selection cooldown - Bounded retry (max_attempts + budget enforcement) - Request preservation across retries - Latency trigger sliding window and block state management - Timeout vs high-latency precedence - Error response detail completeness Integration tests (tests/e2e/): - IT-1 through IT-13 covering 429/503 retry, exhaustion, backoff, fallback priority, Retry-After honoring, timeout retry, high-latency failover, streaming preservation, and body preservation

TroyMitchell911 · 2026-04-23T16:04:41Z

Question about Phase 2 (Provider_List fallback) behavior:

When fallback_models is exhausted and the strategy is SameProvider or DifferentProvider, the selector falls through to Phase 2 which iterates over all_providers with only the strategy filter applied. There is no constraint tying the retry candidates back to the original routing intent.

Consider this scenario: a user configures opus-4.6-source-1 and opus-4.6-source-2 under the same provider for code generation tasks, with fallback_models: [opus-4.6-source-2] and strategy: different_provider. If both sources are down, Phase 2 would pick up any available model from the global Provider_List that matches the strategy filter — potentially a much weaker model like haiku that happens to share the same provider prefix.

The result is a request that was routed for complex code generation silently being served by a model that is not capable of fulfilling it, wasting both time and tokens with no useful output.

Would it make sense to add an option (e.g. fallback_scope: "fallback_list_only" | "all_providers", defaulting to fallback_list_only) so that operators can opt out of the Phase 2 global fallback? That way, when the explicit fallback chain is exhausted, the system returns an error immediately rather than degrading to an unsuitable model.

TroyMitchell911 · 2026-04-27T06:15:37Z

I noticed this PR hasn't been updated for a while. Do you still have time to continue development? If you're currently unavailable, would you be willing to let me continue development based on your previous work? I will retain your signature and author information.

adilhafeez · 2026-04-27T06:26:50Z

@TroyMitchell911 yes would love to see you pick it up. It would be nice if you can split this work into smaller PRs. Please look at the issue and let me know if you have any questions.

TroyMitchell911 · 2026-04-28T13:27:26Z

@TroyMitchell911 yes would love to see you pick it up. It would be nice if you can split this work into smaller PRs. Please look at the issue and let me know if you have any questions.

ready for review #926 #927 #928 #929 #930

adilhafeez reviewed Feb 10, 2026

View reviewed changes

adilhafeez mentioned this pull request Feb 10, 2026

feat: add support for retrying LLM requests on 429 ratelimits (#697) #735

Closed

raheelshahzad force-pushed the feat/retry-on-ratelimit branch from d1aa3ac to ca903d2 Compare February 12, 2026 04:08

raheelshahzad commented Feb 12, 2026

View reviewed changes

raheelshahzad force-pushed the feat/retry-on-ratelimit branch from ca903d2 to 1384982 Compare March 9, 2026 00:43

raheelshahzad changed the title ~~feat: add support for retrying LLM requests on 429 ratelimits (#697)~~ feat: automatic retry and failover for rate-limited LLM requests Mar 9, 2026

raheelshahzad force-pushed the feat/retry-on-ratelimit branch from 1384982 to d569d4f Compare March 9, 2026 00:45

raheelshahzad added 2 commits March 8, 2026 18:44

raheelshahzad force-pushed the feat/retry-on-ratelimit branch from d569d4f to 98bf024 Compare March 9, 2026 01:45

Conversation

raheelshahzad commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Structure (2 commits)

Uh oh!

adilhafeez left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adilhafeez Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

adilhafeez Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

adilhafeez Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

adilhafeez Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

adilhafeez commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raheelshahzad left a comment

Choose a reason for hiding this comment

Uh oh!

adilhafeez commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

salmanap commented Mar 3, 2026

Uh oh!

TroyMitchell911 commented Apr 23, 2026

Uh oh!

TroyMitchell911 commented Apr 27, 2026

Uh oh!

adilhafeez commented Apr 27, 2026

Uh oh!

TroyMitchell911 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

raheelshahzad commented Feb 10, 2026 •

edited

Loading

adilhafeez left a comment •

edited

Loading

adilhafeez commented Feb 10, 2026 •

edited

Loading

adilhafeez commented Feb 12, 2026 •

edited

Loading

TroyMitchell911 commented Apr 28, 2026 •

edited

Loading