test: add retry/failover e2e and unit tests by TroyMitchell911 · Pull Request #930 · katanemo/plano

TroyMitchell911 · 2026-04-28T13:17:09Z

Summary

Continuation of #733. Adds comprehensive tests for the retry/failover feature implemented in #926–#929.

PR Chain

This is PR 5 of 5 (final) in the retry/failover feature series:

feat(common): add retry policy configuration types #926 — Configuration types
feat(common): add retry/failover engine #927 — Retry/failover engine
feat(brightstaff): integrate retry orchestrator into LLM handler #928 — Handler integration
feat: multi-provider failover support #929 — Multi-provider failover support
This PR — E2E and unit tests (includes all commits from feat(common): add retry policy configuration types #926–feat: multi-provider failover support #929, merge those first)

What This PR Adds

E2E Test Configs (13 scenarios)

Each test config uses mock upstream servers to simulate specific failure modes:

Test	Scenario	Config
it1	Basic 429 retry with `different_provider` strategy	`retry_it1_basic_429.yaml`
it2	503 failover to different provider	`retry_it2_503_different_provider.yaml`
it3	All providers exhausted → error response	`retry_it3_all_exhausted.yaml`
it4	No `retry_policy` → no retry (passthrough)	`retry_it4_no_retry_policy.yaml`
it5	`max_attempts` limit respected	`retry_it5_max_attempts.yaml`
it6	Backoff delay between retries	`retry_it6_backoff_delay.yaml`
it7	`fallback_models` priority order	`retry_it7_fallback_priority.yaml`
it8	`Retry-After` header honored	`retry_it8_retry_after_honored.yaml`
it9	`Retry-After` blocks provider selection	`retry_it9_retry_after_blocks_selection.yaml`
it10	Timeout triggers retry	`retry_it10_timeout_triggers_retry.yaml`
it11	High latency failover	`retry_it11_high_latency_failover.yaml`
it12	Streaming response retry	`retry_it12_streaming.yaml`
it13	Request body preserved across retries	`retry_it13_body_preserved.yaml`

Example Test Config Format

model_providers:
  - model: openai/gpt-4o
    base_url: http://host.docker.internal:${MOCK_PRIMARY_PORT}
    access_key: test-key-primary
    default: true
    retry_policy:
      default_strategy: "different_provider"
      default_max_attempts: 2
      on_status_codes:
        - codes: [429]
          strategy: "different_provider"
          max_attempts: 2

  - model: anthropic/claude-3-5-sonnet
    base_url: http://host.docker.internal:${MOCK_SECONDARY_PORT}
    access_key: test-key-secondary

Unit Tests

Failover provider exploration order verification
State preservation across retry attempts

Changes

tests/e2e/configs/retry_it*.yaml: 13 E2E test configurations
tests/e2e/retry_failover_test.rs: Integration test suite
crates/common/src/retry/tests.rs: Unit tests for failover logic

Signed-off-by: Troy Mitchell troymitchell988@gmail.com

Signed-off-by: Troy Mitchell <i@troy-y.org>

Add retry policy configuration types to support automatic retry and failover for LLM requests: - RetryPolicy: top-level config with fallback_models, default_strategy, default_max_attempts, and per-status-code overrides - BackoffConfig: exponential backoff with base_ms, max_ms, jitter, and scope (per-model, per-provider, or global) - RetryAfterConfig: Retry-After header handling with block scope and duration limits - HighLatencyConfig: latency-based blocking with threshold, measurement type, and trigger conditions - LatencyTriggerConfig: min_triggers and trigger_window for debouncing - RetryStrategy enum: same_model, same_provider, different_provider - StatusCodeEntry: flexible status code matching (single, range, list) Also add retry_policy field to GatewayConfig with Default impl. Signed-off-by: Troy Mitchell <i@troy-y.org>

Add comprehensive tests for retry policy configuration: - proptest: round-trip serialization, default invariants, status code expansion (single, range, full range) - YAML pattern tests covering 17 real-world configuration patterns: multi-provider failover, same-provider model downgrade, backoff on multiple error types, per-status-code strategy customization, timeout-specific config, no-retry, backoff scopes (model/provider/ global), high-latency blocking, retry-after handling, fallback models list, mixed integer and range codes Signed-off-by: Troy Mitchell <i@troy-y.org>

Add JSON schema definitions for retry policy configuration including RetryPolicy, BackoffConfig, RetryAfterConfig, HighLatencyConfig, LatencyTriggerConfig, RetryStrategy, StatusCodeEntry, and all associated enums. Signed-off-by: Troy Mitchell <i@troy-y.org>

Signed-off-by: Troy Mitchell <i@troy-y.org>

Add the retry module with core type definitions including: - RequestContext, RequestSignature for request deduplication - RetryExhaustedError, AllProvidersExhaustedError for error handling - AttemptError, AttemptErrorType for attempt tracking - ValidationError, ValidationWarning for config validation - Helper functions for provider extraction and hashing Wire up pub mod retry in lib.rs. Signed-off-by: Troy Mitchell <i@troy-y.org>

Implement BackoffCalculator supporting: - Exponential backoff with configurable base/max delay - Full, equal, and decorrelated jitter strategies - Per-provider and per-status-code backoff overrides - Comprehensive unit tests for all strategies Signed-off-by: Troy Mitchell <i@troy-y.org>

Implement ErrorDetector that classifies HTTP responses into: - Retryable errors (5xx, 429, timeouts) - Non-retryable errors (4xx client errors) - Successful responses Supports configurable status code matching and latency-based error detection with measurement strategies (TTFB/total). Signed-off-by: Troy Mitchell <i@troy-y.org>

Implement RetryErrorResponseBuilder that constructs structured JSON error responses when all retry attempts are exhausted, including per-attempt error details and provider information. Signed-off-by: Troy Mitchell <i@troy-y.org>

Add three state management components: - LatencyBlockStateManager: tracks providers blocked due to high latency with configurable block duration and scope - LatencyTriggerCounter: counts consecutive latency threshold breaches before triggering provider blocking - RetryAfterStateManager: honors Retry-After headers with per-provider/model/endpoint blocking scope Signed-off-by: Troy Mitchell <i@troy-y.org>

Implement validate_retry_policy() that checks retry policy configuration for errors and warnings including: - Invalid max_retries/timeout ranges - Conflicting backoff and jitter settings - Missing or invalid provider references - Latency threshold consistency checks Signed-off-by: Troy Mitchell <i@troy-y.org>

Implement ProviderSelector that determines the next provider for retry attempts based on: - Failover provider list with priority ordering - Latency-blocked provider filtering - Retry-After header honoring - Round-robin and priority-based selection strategies Signed-off-by: Troy Mitchell <i@troy-y.org>

Implement RetryOrchestrator as the top-level coordinator that: - Manages the full retry lifecycle per request - Integrates backoff, error detection, provider selection - Handles request deduplication via content hashing - Supports both same-provider retry and cross-provider failover - Emits structured attempt records for observability Signed-off-by: Troy Mitchell <i@troy-y.org>

Signed-off-by: Troy Mitchell <i@troy-y.org>

Avoid re-serializing request bodies when unnecessary to maintain JSON key order, whitespace, and unknown fields — critical for prompt cache prefix matching on providers like Anthropic. - routing_service: only re-serialize when routing_preferences were actually removed from the body - stream_context: replace model name at byte level instead of full deserialization/re-serialization cycle - Strip provider prefix from model name (e.g. 'custom-aws/claude-opus-4-6' -> 'claude-opus-4-6') before sending upstream Signed-off-by: Troy Mitchell <i@troy-y.org>

Wire up the retry module into the brightstaff LLM handler: - Add send_upstream_with_retry() that uses RetryOrchestrator to coordinate retry attempts with backoff and failover - Build forward_fn closure for per-attempt HTTP calls - Support failover to alternative providers on retryable errors - Fall back to single-attempt send_upstream() when no retry policy is configured Signed-off-by: Troy Mitchell <i@troy-y.org>

Remove model_id-level deduplication that prevented different providers from serving the same model (e.g., custom/claude-opus-4-6 and custom-aws/claude-opus-4-6). Full model_name dedup is already handled upstream. Also fix model field to use full provider/model format instead of stripped model_id, and adjust router model lookup accordingly. Signed-off-by: Troy Mitchell <i@troy-y.org>

Clone request_path into the retry closure so that failover requests to alternate providers preserve the original request path. Signed-off-by: Troy Mitchell <i@troy-y.org>

Provide a minimal example configuration demonstrating retry and multi-provider failover setup with OpenAI and Anthropic providers. Signed-off-by: Troy Mitchell <i@troy-y.org>

Add 13 test-specific plano configs covering: - Basic 429/503 retry scenarios - Multi-provider failover with priority ordering - Max attempts and backoff delay verification - Retry-After header honoring and blocking - Timeout-triggered retry - High latency failover - Streaming retry - Request body preservation across retries Signed-off-by: Troy Mitchell <i@troy-y.org>

Comprehensive pytest-based e2e test suite with mock HTTP servers simulating provider failures (429, 503, timeouts, high latency). Tests verify retry orchestration, provider selection, backoff timing, Retry-After header handling, streaming retry, and request body preservation across retry attempts. Signed-off-by: Troy Mitchell <i@troy-y.org>

- test_failover_exploration.py: verify provider selection logic, fallback ordering, and error-triggered failover behavior - test_failover_preservation.py: verify request context (headers, body, path) is preserved across failover attempts Signed-off-by: Troy Mitchell <i@troy-y.org>

TroyMitchell911 added 22 commits April 28, 2026 15:20

common: add proptest dev-dependency for configuration tests

2548aa7

Signed-off-by: Troy Mitchell <i@troy-y.org>

common: add sha2, dashmap, tokio runtime dependencies for retry module

6853e4d

Signed-off-by: Troy Mitchell <i@troy-y.org>

retry: add error response builder for retry exhaustion

47a3e8a

Implement RetryErrorResponseBuilder that constructs structured JSON error responses when all retry attempts are exhausted, including per-attempt error details and provider information. Signed-off-by: Troy Mitchell <i@troy-y.org>

retry: update Cargo.lock for retry module dependencies

d29ed70

Signed-off-by: Troy Mitchell <i@troy-y.org>

llm handler: capture request_path for failover forwarding

7fecea1

Clone request_path into the retry closure so that failover requests to alternate providers preserve the original request path. Signed-off-by: Troy Mitchell <i@troy-y.org>

add example plano_config.yaml with retry/failover settings

4da0c63

Provide a minimal example configuration demonstrating retry and multi-provider failover setup with OpenAI and Anthropic providers. Signed-off-by: Troy Mitchell <i@troy-y.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add retry/failover e2e and unit tests#930

test: add retry/failover e2e and unit tests#930
TroyMitchell911 wants to merge 22 commits into
katanemo:mainfrom
TroyMitchell911:feat/retry-e2e-tests

TroyMitchell911 commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TroyMitchell911 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

PR Chain

What This PR Adds

E2E Test Configs (13 scenarios)

Example Test Config Format

Unit Tests

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TroyMitchell911 commented Apr 28, 2026 •

edited

Loading