Skip to content

test: add retry/failover e2e and unit tests#930

Open
TroyMitchell911 wants to merge 22 commits into
katanemo:mainfrom
TroyMitchell911:feat/retry-e2e-tests
Open

test: add retry/failover e2e and unit tests#930
TroyMitchell911 wants to merge 22 commits into
katanemo:mainfrom
TroyMitchell911:feat/retry-e2e-tests

Conversation

@TroyMitchell911
Copy link
Copy Markdown

@TroyMitchell911 TroyMitchell911 commented Apr 28, 2026

Summary

Continuation of #733. Adds comprehensive tests for the retry/failover feature implemented in #926#929.

PR Chain

This is PR 5 of 5 (final) in the retry/failover feature series:

  1. feat(common): add retry policy configuration types #926 — Configuration types
  2. feat(common): add retry/failover engine #927 — Retry/failover engine
  3. feat(brightstaff): integrate retry orchestrator into LLM handler #928 — Handler integration
  4. feat: multi-provider failover support #929 — Multi-provider failover support
  5. This PR — E2E and unit tests (includes all commits from feat(common): add retry policy configuration types #926feat: multi-provider failover support #929, merge those first)

What This PR Adds

E2E Test Configs (13 scenarios)

Each test config uses mock upstream servers to simulate specific failure modes:

Test Scenario Config
it1 Basic 429 retry with different_provider strategy retry_it1_basic_429.yaml
it2 503 failover to different provider retry_it2_503_different_provider.yaml
it3 All providers exhausted → error response retry_it3_all_exhausted.yaml
it4 No retry_policy → no retry (passthrough) retry_it4_no_retry_policy.yaml
it5 max_attempts limit respected retry_it5_max_attempts.yaml
it6 Backoff delay between retries retry_it6_backoff_delay.yaml
it7 fallback_models priority order retry_it7_fallback_priority.yaml
it8 Retry-After header honored retry_it8_retry_after_honored.yaml
it9 Retry-After blocks provider selection retry_it9_retry_after_blocks_selection.yaml
it10 Timeout triggers retry retry_it10_timeout_triggers_retry.yaml
it11 High latency failover retry_it11_high_latency_failover.yaml
it12 Streaming response retry retry_it12_streaming.yaml
it13 Request body preserved across retries retry_it13_body_preserved.yaml

Example Test Config Format

model_providers:
  - model: openai/gpt-4o
    base_url: http://host.docker.internal:${MOCK_PRIMARY_PORT}
    access_key: test-key-primary
    default: true
    retry_policy:
      default_strategy: "different_provider"
      default_max_attempts: 2
      on_status_codes:
        - codes: [429]
          strategy: "different_provider"
          max_attempts: 2

  - model: anthropic/claude-3-5-sonnet
    base_url: http://host.docker.internal:${MOCK_SECONDARY_PORT}
    access_key: test-key-secondary

Unit Tests

  • Failover provider exploration order verification
  • State preservation across retry attempts

Changes

  • tests/e2e/configs/retry_it*.yaml: 13 E2E test configurations
  • tests/e2e/retry_failover_test.rs: Integration test suite
  • crates/common/src/retry/tests.rs: Unit tests for failover logic

Signed-off-by: Troy Mitchell troymitchell988@gmail.com

Signed-off-by: Troy Mitchell <i@troy-y.org>
Add retry policy configuration types to support automatic retry and
failover for LLM requests:

- RetryPolicy: top-level config with fallback_models, default_strategy,
  default_max_attempts, and per-status-code overrides
- BackoffConfig: exponential backoff with base_ms, max_ms, jitter, and
  scope (per-model, per-provider, or global)
- RetryAfterConfig: Retry-After header handling with block scope and
  duration limits
- HighLatencyConfig: latency-based blocking with threshold, measurement
  type, and trigger conditions
- LatencyTriggerConfig: min_triggers and trigger_window for debouncing
- RetryStrategy enum: same_model, same_provider, different_provider
- StatusCodeEntry: flexible status code matching (single, range, list)

Also add retry_policy field to GatewayConfig with Default impl.

Signed-off-by: Troy Mitchell <i@troy-y.org>
Add comprehensive tests for retry policy configuration:

- proptest: round-trip serialization, default invariants, status code
  expansion (single, range, full range)
- YAML pattern tests covering 17 real-world configuration patterns:
  multi-provider failover, same-provider model downgrade, backoff on
  multiple error types, per-status-code strategy customization,
  timeout-specific config, no-retry, backoff scopes (model/provider/
  global), high-latency blocking, retry-after handling, fallback
  models list, mixed integer and range codes

Signed-off-by: Troy Mitchell <i@troy-y.org>
Add JSON schema definitions for retry policy configuration including
RetryPolicy, BackoffConfig, RetryAfterConfig, HighLatencyConfig,
LatencyTriggerConfig, RetryStrategy, StatusCodeEntry, and all
associated enums.

Signed-off-by: Troy Mitchell <i@troy-y.org>
Add the retry module with core type definitions including:
- RequestContext, RequestSignature for request deduplication
- RetryExhaustedError, AllProvidersExhaustedError for error handling
- AttemptError, AttemptErrorType for attempt tracking
- ValidationError, ValidationWarning for config validation
- Helper functions for provider extraction and hashing

Wire up pub mod retry in lib.rs.

Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement BackoffCalculator supporting:
- Exponential backoff with configurable base/max delay
- Full, equal, and decorrelated jitter strategies
- Per-provider and per-status-code backoff overrides
- Comprehensive unit tests for all strategies

Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement ErrorDetector that classifies HTTP responses into:
- Retryable errors (5xx, 429, timeouts)
- Non-retryable errors (4xx client errors)
- Successful responses
Supports configurable status code matching and latency-based
error detection with measurement strategies (TTFB/total).

Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement RetryErrorResponseBuilder that constructs structured
JSON error responses when all retry attempts are exhausted,
including per-attempt error details and provider information.

Signed-off-by: Troy Mitchell <i@troy-y.org>
Add three state management components:
- LatencyBlockStateManager: tracks providers blocked due to
  high latency with configurable block duration and scope
- LatencyTriggerCounter: counts consecutive latency threshold
  breaches before triggering provider blocking
- RetryAfterStateManager: honors Retry-After headers with
  per-provider/model/endpoint blocking scope

Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement validate_retry_policy() that checks retry policy
configuration for errors and warnings including:
- Invalid max_retries/timeout ranges
- Conflicting backoff and jitter settings
- Missing or invalid provider references
- Latency threshold consistency checks

Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement ProviderSelector that determines the next provider
for retry attempts based on:
- Failover provider list with priority ordering
- Latency-blocked provider filtering
- Retry-After header honoring
- Round-robin and priority-based selection strategies

Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement RetryOrchestrator as the top-level coordinator that:
- Manages the full retry lifecycle per request
- Integrates backoff, error detection, provider selection
- Handles request deduplication via content hashing
- Supports both same-provider retry and cross-provider failover
- Emits structured attempt records for observability

Signed-off-by: Troy Mitchell <i@troy-y.org>
Signed-off-by: Troy Mitchell <i@troy-y.org>
Avoid re-serializing request bodies when unnecessary to maintain
JSON key order, whitespace, and unknown fields — critical for
prompt cache prefix matching on providers like Anthropic.

- routing_service: only re-serialize when routing_preferences
  were actually removed from the body
- stream_context: replace model name at byte level instead of
  full deserialization/re-serialization cycle
- Strip provider prefix from model name (e.g. 'custom-aws/claude-opus-4-6'
  -> 'claude-opus-4-6') before sending upstream

Signed-off-by: Troy Mitchell <i@troy-y.org>
Wire up the retry module into the brightstaff LLM handler:
- Add send_upstream_with_retry() that uses RetryOrchestrator
  to coordinate retry attempts with backoff and failover
- Build forward_fn closure for per-attempt HTTP calls
- Support failover to alternative providers on retryable errors
- Fall back to single-attempt send_upstream() when no retry
  policy is configured

Signed-off-by: Troy Mitchell <i@troy-y.org>
Remove model_id-level deduplication that prevented different providers
from serving the same model (e.g., custom/claude-opus-4-6 and
custom-aws/claude-opus-4-6). Full model_name dedup is already handled
upstream. Also fix model field to use full provider/model format
instead of stripped model_id, and adjust router model lookup
accordingly.

Signed-off-by: Troy Mitchell <i@troy-y.org>
Clone request_path into the retry closure so that failover requests
to alternate providers preserve the original request path.

Signed-off-by: Troy Mitchell <i@troy-y.org>
Provide a minimal example configuration demonstrating retry and
multi-provider failover setup with OpenAI and Anthropic providers.

Signed-off-by: Troy Mitchell <i@troy-y.org>
Add 13 test-specific plano configs covering:
- Basic 429/503 retry scenarios
- Multi-provider failover with priority ordering
- Max attempts and backoff delay verification
- Retry-After header honoring and blocking
- Timeout-triggered retry
- High latency failover
- Streaming retry
- Request body preservation across retries

Signed-off-by: Troy Mitchell <i@troy-y.org>
Comprehensive pytest-based e2e test suite with mock HTTP servers
simulating provider failures (429, 503, timeouts, high latency).
Tests verify retry orchestration, provider selection, backoff timing,
Retry-After header handling, streaming retry, and request body
preservation across retry attempts.

Signed-off-by: Troy Mitchell <i@troy-y.org>
- test_failover_exploration.py: verify provider selection logic,
  fallback ordering, and error-triggered failover behavior
- test_failover_preservation.py: verify request context (headers,
  body, path) is preserved across failover attempts

Signed-off-by: Troy Mitchell <i@troy-y.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant