test: add retry/failover e2e and unit tests#930
Open
TroyMitchell911 wants to merge 22 commits into
Open
Conversation
Signed-off-by: Troy Mitchell <i@troy-y.org>
Add retry policy configuration types to support automatic retry and failover for LLM requests: - RetryPolicy: top-level config with fallback_models, default_strategy, default_max_attempts, and per-status-code overrides - BackoffConfig: exponential backoff with base_ms, max_ms, jitter, and scope (per-model, per-provider, or global) - RetryAfterConfig: Retry-After header handling with block scope and duration limits - HighLatencyConfig: latency-based blocking with threshold, measurement type, and trigger conditions - LatencyTriggerConfig: min_triggers and trigger_window for debouncing - RetryStrategy enum: same_model, same_provider, different_provider - StatusCodeEntry: flexible status code matching (single, range, list) Also add retry_policy field to GatewayConfig with Default impl. Signed-off-by: Troy Mitchell <i@troy-y.org>
Add comprehensive tests for retry policy configuration: - proptest: round-trip serialization, default invariants, status code expansion (single, range, full range) - YAML pattern tests covering 17 real-world configuration patterns: multi-provider failover, same-provider model downgrade, backoff on multiple error types, per-status-code strategy customization, timeout-specific config, no-retry, backoff scopes (model/provider/ global), high-latency blocking, retry-after handling, fallback models list, mixed integer and range codes Signed-off-by: Troy Mitchell <i@troy-y.org>
Add JSON schema definitions for retry policy configuration including RetryPolicy, BackoffConfig, RetryAfterConfig, HighLatencyConfig, LatencyTriggerConfig, RetryStrategy, StatusCodeEntry, and all associated enums. Signed-off-by: Troy Mitchell <i@troy-y.org>
Signed-off-by: Troy Mitchell <i@troy-y.org>
Add the retry module with core type definitions including: - RequestContext, RequestSignature for request deduplication - RetryExhaustedError, AllProvidersExhaustedError for error handling - AttemptError, AttemptErrorType for attempt tracking - ValidationError, ValidationWarning for config validation - Helper functions for provider extraction and hashing Wire up pub mod retry in lib.rs. Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement BackoffCalculator supporting: - Exponential backoff with configurable base/max delay - Full, equal, and decorrelated jitter strategies - Per-provider and per-status-code backoff overrides - Comprehensive unit tests for all strategies Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement ErrorDetector that classifies HTTP responses into: - Retryable errors (5xx, 429, timeouts) - Non-retryable errors (4xx client errors) - Successful responses Supports configurable status code matching and latency-based error detection with measurement strategies (TTFB/total). Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement RetryErrorResponseBuilder that constructs structured JSON error responses when all retry attempts are exhausted, including per-attempt error details and provider information. Signed-off-by: Troy Mitchell <i@troy-y.org>
Add three state management components: - LatencyBlockStateManager: tracks providers blocked due to high latency with configurable block duration and scope - LatencyTriggerCounter: counts consecutive latency threshold breaches before triggering provider blocking - RetryAfterStateManager: honors Retry-After headers with per-provider/model/endpoint blocking scope Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement validate_retry_policy() that checks retry policy configuration for errors and warnings including: - Invalid max_retries/timeout ranges - Conflicting backoff and jitter settings - Missing or invalid provider references - Latency threshold consistency checks Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement ProviderSelector that determines the next provider for retry attempts based on: - Failover provider list with priority ordering - Latency-blocked provider filtering - Retry-After header honoring - Round-robin and priority-based selection strategies Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement RetryOrchestrator as the top-level coordinator that: - Manages the full retry lifecycle per request - Integrates backoff, error detection, provider selection - Handles request deduplication via content hashing - Supports both same-provider retry and cross-provider failover - Emits structured attempt records for observability Signed-off-by: Troy Mitchell <i@troy-y.org>
Signed-off-by: Troy Mitchell <i@troy-y.org>
Avoid re-serializing request bodies when unnecessary to maintain JSON key order, whitespace, and unknown fields — critical for prompt cache prefix matching on providers like Anthropic. - routing_service: only re-serialize when routing_preferences were actually removed from the body - stream_context: replace model name at byte level instead of full deserialization/re-serialization cycle - Strip provider prefix from model name (e.g. 'custom-aws/claude-opus-4-6' -> 'claude-opus-4-6') before sending upstream Signed-off-by: Troy Mitchell <i@troy-y.org>
Wire up the retry module into the brightstaff LLM handler: - Add send_upstream_with_retry() that uses RetryOrchestrator to coordinate retry attempts with backoff and failover - Build forward_fn closure for per-attempt HTTP calls - Support failover to alternative providers on retryable errors - Fall back to single-attempt send_upstream() when no retry policy is configured Signed-off-by: Troy Mitchell <i@troy-y.org>
Remove model_id-level deduplication that prevented different providers from serving the same model (e.g., custom/claude-opus-4-6 and custom-aws/claude-opus-4-6). Full model_name dedup is already handled upstream. Also fix model field to use full provider/model format instead of stripped model_id, and adjust router model lookup accordingly. Signed-off-by: Troy Mitchell <i@troy-y.org>
Clone request_path into the retry closure so that failover requests to alternate providers preserve the original request path. Signed-off-by: Troy Mitchell <i@troy-y.org>
Provide a minimal example configuration demonstrating retry and multi-provider failover setup with OpenAI and Anthropic providers. Signed-off-by: Troy Mitchell <i@troy-y.org>
Add 13 test-specific plano configs covering: - Basic 429/503 retry scenarios - Multi-provider failover with priority ordering - Max attempts and backoff delay verification - Retry-After header honoring and blocking - Timeout-triggered retry - High latency failover - Streaming retry - Request body preservation across retries Signed-off-by: Troy Mitchell <i@troy-y.org>
Comprehensive pytest-based e2e test suite with mock HTTP servers simulating provider failures (429, 503, timeouts, high latency). Tests verify retry orchestration, provider selection, backoff timing, Retry-After header handling, streaming retry, and request body preservation across retry attempts. Signed-off-by: Troy Mitchell <i@troy-y.org>
- test_failover_exploration.py: verify provider selection logic, fallback ordering, and error-triggered failover behavior - test_failover_preservation.py: verify request context (headers, body, path) is preserved across failover attempts Signed-off-by: Troy Mitchell <i@troy-y.org>
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Continuation of #733. Adds comprehensive tests for the retry/failover feature implemented in #926–#929.
PR Chain
This is PR 5 of 5 (final) in the retry/failover feature series:
What This PR Adds
E2E Test Configs (13 scenarios)
Each test config uses mock upstream servers to simulate specific failure modes:
different_providerstrategyretry_it1_basic_429.yamlretry_it2_503_different_provider.yamlretry_it3_all_exhausted.yamlretry_policy→ no retry (passthrough)retry_it4_no_retry_policy.yamlmax_attemptslimit respectedretry_it5_max_attempts.yamlretry_it6_backoff_delay.yamlfallback_modelspriority orderretry_it7_fallback_priority.yamlRetry-Afterheader honoredretry_it8_retry_after_honored.yamlRetry-Afterblocks provider selectionretry_it9_retry_after_blocks_selection.yamlretry_it10_timeout_triggers_retry.yamlretry_it11_high_latency_failover.yamlretry_it12_streaming.yamlretry_it13_body_preserved.yamlExample Test Config Format
Unit Tests
Changes
tests/e2e/configs/retry_it*.yaml: 13 E2E test configurationstests/e2e/retry_failover_test.rs: Integration test suitecrates/common/src/retry/tests.rs: Unit tests for failover logicSigned-off-by: Troy Mitchell troymitchell988@gmail.com