feat: multi-key rotation & intelligent rate-limit recovery#65
Open
feat: multi-key rotation & intelligent rate-limit recovery#65
Conversation
Covers 9 scenarios: initialization with run_id, single-token migration, key rotation on rate limit, exhausted pool returning None, auto-wait with mocked sleep, auto-wait disabled, max-wait exceeded, pool status counts, and audit event logging for key_rate_limited / key_rotated / key_pool_waiting / key_pool_resumed events. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers POST /keys (add, encrypt, mask, validation), GET /keys (list, mask,
status fields), PATCH /keys/{id} (partial update, not-found), DELETE
/keys/{id} (removes key, 409 for last enabled, 404 for missing),
GET /keys/status (pool structure, counts, masked keys), GET /keys/config
(defaults), and documents the PATCH /keys/config routing bug where FastAPI
resolves "config" as a key_id before the dedicated config handler is reached.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…it recovery Covers database CRUD (add/list/update/delete with encryption), key selection (priority, LRU, disabled/rate-limited skipping, provider isolation), rotation logic (ping-pong, codex fallback, audit logging), config defaults/updates, one-time migration from settings.claude_token, and auto-wait behaviour with mocked asyncio.sleep. Tests use real temporary SQLite databases via agent.db.init_db and a temporary Fernet master key; key_pool.MASTER_KEY_PATH and monitor.crypto._fernet are patched per-test to ensure isolation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a key pool system that rotates across multiple Claude Code OAuth keys on rate limit, with optional Codex fallback and auto-wait for cooldown recovery. This eliminates manual intervention when rate limits are hit. New files: - key_pool.py: KeyPool class with priority-based selection, rotation, auto-wait, config management, and single-token migration - codex_client.py: Thin OpenAI Codex API adapter (off by default) - test_key_pool.py: 58 unit tests for key pool logic - test_key_pool_integration.py: 9 integration tests - test_codex_client.py: 16 unit tests for Codex adapter - test_key_endpoints.py: 34 endpoint tests Modified files: - db.py: Add api_keys + key_rotation_config tables - runner.py: Replace rate limit handler with key rotation, add SDK client restart loop for seamless key swaps - endpoints.py: Add /keys/* CRUD and /keys/config management routes - monitor/app.py: Add proxy routes + matching schema tables - permissions.py: Block agent reads of key pool data 602/603 tests pass (1 pre-existing failure unrelated to this change). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The runner was calling mark_rate_limited() explicitly then calling handle_rate_limit() which also calls it internally. This caused the resets_at/utilization to be overwritten with None on the second call. Also removed duplicate audit log entries — key_pool.py is now the single source of truth for key rotation and fallback audit events. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eding - Re-fetch key from DB after auto-wait sleep to avoid stale state - Add signals.has_pending_signals() public method instead of accessing _signal_queue internals from runner.py - Move config seeding flag to instance level to avoid test pollution - Log decrypt failures in list_keys instead of silently masking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…l_requests, wire Codex mode - Add asyncio.Lock to KeyPool for task-safe concurrent access - Add __repr__ to CodexClient that masks API key (prevents leaks in tracebacks) - Rename misleading total_requests to rate_limit_hits (only incremented on limit) - Wire Codex degraded mode into runner: polls for Claude key recovery every 60s instead of falling through as a no-op - Add 3 new tests for CodexClient repr masking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gration tests
- Fix monitor proxy route ordering: /api/keys/status and /api/keys/config
must be registered before /api/keys/{key_id} to avoid matching "status"
and "config" as key_id parameters (same fix applied earlier to agent endpoints)
- Add DB migration for total_requests -> rate_limit_hits column rename
(handles databases created during intermediate testing)
- Add 3 integration tests: codex fallback enabled, codex fallback disabled,
concurrent handle_rate_limit no-deadlock
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…handling 4 tests that exercise _run_loop directly with mocked SDK client: - Key rotation on rate limit (switches to second key, restarts client) - Pause when all keys exhausted (auto_wait disabled) - Codex fallback polls for Claude key recovery - Legacy single-key path (no key pool) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ock docs - Remove unnecessary CodexClient instantiation in runner (kept decrypted key in memory for entire polling loop with no usage) - Check signals before sleeping in Codex degraded mode (was sleeping 60s before first signal check, now checks immediately) - Sleep in 10s chunks instead of 60s for responsive signal handling - Log audit event when Codex fallback ends due to signal - Add concurrency safety notes to mark_rate_limited/clear_rate_limit docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…king Addresses security audit findings: - M1: Add allow-list validation in update_config() — rejects any config key not in DEFAULT_CONFIG, preventing arbitrary key injection - M4: Block Bash access to /data/ directory (master key, databases) in permissions.py — covers cat, head, tail, less, cp, xxd, base64 - Add test for unknown config key rejection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses security audit M1: the monitor proxy routes used `body: dict` which bypassed Pydantic validation, allowing arbitrary JSON to flow through to the agent. Now uses typed models: AddKeyProxyRequest, UpdateKeyProxyRequest, RotationConfigProxyUpdate — matching the agent endpoint models and providing input validation at the proxy layer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies that arbitrary key-value pairs in update_config are rejected at the KeyPool level, preventing injection into key_rotation_config table. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Logs a key_pool_exhausted event with earliest_reset, wait_minutes, and active_key_id when all keys are rate-limited and auto-wait is exhausted or disabled. Completes the full set of 8 audit event types specified in the PRD. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies that the run_id is preserved across key rotations and that audit events are logged against the correct run_id. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tological assertion - runner.py: Track actual sleep duration instead of hardcoded 60s in codex polling loop, preventing waited counter from exceeding real elapsed time - key_pool.py: Move audit logging and status updates outside asyncio.Lock scope in wait_for_next_available_key to avoid holding lock during slow DB writes - test_key_pool_integration.py: Replace tautological assertion in concurrent deadlock test with meaningful checks on result count Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RunManager.start_run() now uses KeyPool.get_next_key() to assign each parallel worker a distinct API key from the pool. The LRU tiebreaker in get_next_key() naturally distributes keys across workers spawned in sequence. Falls back to the raw credential if the pool is empty. Since all workers share the same SQLite DB, rate limit state written by one worker's KeyPool is immediately visible to other workers — Worker B will skip a key that Worker A just marked as rate-limited. Adds 3 integration tests: - test_parallel_workers_get_different_keys - test_worker_rate_limit_visible_to_other_workers - test_single_key_shared_across_workers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements PRD Section 6 security requirement: rate limiting on key management endpoints to prevent brute-force key extraction. - New agent/rate_limit.py: sliding window RateLimiter class (~50 lines) with injectable time function for testability - All 7 /keys/* routes in endpoints.py protected via FastAPI Depends - All 7 /api/keys/* proxy routes in monitor/app.py also protected with a separate limiter instance (prevents bypass via monitor) - Returns HTTP 429 with Retry-After header when limit exceeded - 3 new tests: under threshold, at threshold (429), window reset Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Health endpoint reads from signals.current_run_id, not main._current_run_id. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…, worker detection, env cleanup - key_pool.py: wrap delete_key in asyncio.Lock to prevent TOCTOU race deleting last enabled key - run_manager.py: share single KeyPool instance across worker launches for proper LRU distribution - run_manager.py: restore WORKER_MODE=1 env var for worker container detection - endpoints.py: use module-level _key_pool singleton so asyncio.Lock actually protects state - runner.py: clean up CLAUDE_CODE_OAUTH_TOKEN from os.environ in finally blocks - permissions.py: block bare env/printenv commands that dump all environment variables Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Multi-key pool: rotate across multiple Claude Code OAuth keys automatically on rate limit, zero downtime when a second key is available.
Key changes
Review fixes (c039061)
Pre-Landing Review
5 critical issues found and fixed. 7 informational findings logged.
Test plan