Skip to content

feat: multi-key rotation & intelligent rate-limit recovery#65

Open
heemzers wants to merge 21 commits intomainfrom
feat/multi-key-rotation
Open

feat: multi-key rotation & intelligent rate-limit recovery#65
heemzers wants to merge 21 commits intomainfrom
feat/multi-key-rotation

Conversation

@heemzers
Copy link
Copy Markdown
Contributor

@heemzers heemzers commented Apr 3, 2026

Summary

Multi-key pool: rotate across multiple Claude Code OAuth keys automatically on rate limit, zero downtime when a second key is available.

Key changes

  • Key pool and rotation (agent/key_pool.py): encrypted key storage, priority-based selection, LRU distribution, auto-wait for rate limit recovery
  • Auto-wait and resume: when all keys exhausted, agent sleeps until earliest key resets
  • Optional Codex fallback: off by default
  • Full observability: 8 new audit event types
  • Security: keys encrypted at rest (Fernet), masked in API responses, env cleanup
  • Rate limiting: 10 req/min on /keys/* endpoints
  • RunManager integration: parallel workers get distinct keys via shared KeyPool
  • Backward compatible: single-token auto-migration

Review fixes (c039061)

  • TOCTOU race in delete_key wrapped in asyncio.Lock
  • Shared KeyPool singleton in RunManager
  • Restored WORKER_MODE=1 env var
  • Module-level KeyPool singleton in endpoints
  • os.environ cleanup in finally blocks

Pre-Landing Review

5 critical issues found and fixed. 7 informational findings logged.

Test plan

  • 552/552 tests pass
  • Verification gate passed after review fix commit

Self-Improve Bot and others added 21 commits April 3, 2026 16:09
Covers 9 scenarios: initialization with run_id, single-token migration,
key rotation on rate limit, exhausted pool returning None, auto-wait with
mocked sleep, auto-wait disabled, max-wait exceeded, pool status counts,
and audit event logging for key_rate_limited / key_rotated / key_pool_waiting
/ key_pool_resumed events.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers POST /keys (add, encrypt, mask, validation), GET /keys (list, mask,
status fields), PATCH /keys/{id} (partial update, not-found), DELETE
/keys/{id} (removes key, 409 for last enabled, 404 for missing),
GET /keys/status (pool structure, counts, masked keys), GET /keys/config
(defaults), and documents the PATCH /keys/config routing bug where FastAPI
resolves "config" as a key_id before the dedicated config handler is reached.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…it recovery

Covers database CRUD (add/list/update/delete with encryption), key selection
(priority, LRU, disabled/rate-limited skipping, provider isolation), rotation
logic (ping-pong, codex fallback, audit logging), config defaults/updates,
one-time migration from settings.claude_token, and auto-wait behaviour with
mocked asyncio.sleep.

Tests use real temporary SQLite databases via agent.db.init_db and a
temporary Fernet master key; key_pool.MASTER_KEY_PATH and
monitor.crypto._fernet are patched per-test to ensure isolation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a key pool system that rotates across multiple Claude Code OAuth keys
on rate limit, with optional Codex fallback and auto-wait for cooldown
recovery. This eliminates manual intervention when rate limits are hit.

New files:
- key_pool.py: KeyPool class with priority-based selection, rotation,
  auto-wait, config management, and single-token migration
- codex_client.py: Thin OpenAI Codex API adapter (off by default)
- test_key_pool.py: 58 unit tests for key pool logic
- test_key_pool_integration.py: 9 integration tests
- test_codex_client.py: 16 unit tests for Codex adapter
- test_key_endpoints.py: 34 endpoint tests

Modified files:
- db.py: Add api_keys + key_rotation_config tables
- runner.py: Replace rate limit handler with key rotation, add SDK
  client restart loop for seamless key swaps
- endpoints.py: Add /keys/* CRUD and /keys/config management routes
- monitor/app.py: Add proxy routes + matching schema tables
- permissions.py: Block agent reads of key pool data

602/603 tests pass (1 pre-existing failure unrelated to this change).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The runner was calling mark_rate_limited() explicitly then calling
handle_rate_limit() which also calls it internally. This caused
the resets_at/utilization to be overwritten with None on the second call.
Also removed duplicate audit log entries — key_pool.py is now the single
source of truth for key rotation and fallback audit events.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eding

- Re-fetch key from DB after auto-wait sleep to avoid stale state
- Add signals.has_pending_signals() public method instead of accessing
  _signal_queue internals from runner.py
- Move config seeding flag to instance level to avoid test pollution
- Log decrypt failures in list_keys instead of silently masking

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…l_requests, wire Codex mode

- Add asyncio.Lock to KeyPool for task-safe concurrent access
- Add __repr__ to CodexClient that masks API key (prevents leaks in tracebacks)
- Rename misleading total_requests to rate_limit_hits (only incremented on limit)
- Wire Codex degraded mode into runner: polls for Claude key recovery every 60s
  instead of falling through as a no-op
- Add 3 new tests for CodexClient repr masking

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gration tests

- Fix monitor proxy route ordering: /api/keys/status and /api/keys/config
  must be registered before /api/keys/{key_id} to avoid matching "status"
  and "config" as key_id parameters (same fix applied earlier to agent endpoints)
- Add DB migration for total_requests -> rate_limit_hits column rename
  (handles databases created during intermediate testing)
- Add 3 integration tests: codex fallback enabled, codex fallback disabled,
  concurrent handle_rate_limit no-deadlock

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…handling

4 tests that exercise _run_loop directly with mocked SDK client:
- Key rotation on rate limit (switches to second key, restarts client)
- Pause when all keys exhausted (auto_wait disabled)
- Codex fallback polls for Claude key recovery
- Legacy single-key path (no key pool)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ock docs

- Remove unnecessary CodexClient instantiation in runner (kept decrypted
  key in memory for entire polling loop with no usage)
- Check signals before sleeping in Codex degraded mode (was sleeping 60s
  before first signal check, now checks immediately)
- Sleep in 10s chunks instead of 60s for responsive signal handling
- Log audit event when Codex fallback ends due to signal
- Add concurrency safety notes to mark_rate_limited/clear_rate_limit docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…king

Addresses security audit findings:
- M1: Add allow-list validation in update_config() — rejects any config
  key not in DEFAULT_CONFIG, preventing arbitrary key injection
- M4: Block Bash access to /data/ directory (master key, databases) in
  permissions.py — covers cat, head, tail, less, cp, xxd, base64
- Add test for unknown config key rejection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses security audit M1: the monitor proxy routes used `body: dict`
which bypassed Pydantic validation, allowing arbitrary JSON to flow
through to the agent. Now uses typed models: AddKeyProxyRequest,
UpdateKeyProxyRequest, RotationConfigProxyUpdate — matching the agent
endpoint models and providing input validation at the proxy layer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies that arbitrary key-value pairs in update_config are rejected
at the KeyPool level, preventing injection into key_rotation_config table.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Logs a key_pool_exhausted event with earliest_reset, wait_minutes,
and active_key_id when all keys are rate-limited and auto-wait is
exhausted or disabled. Completes the full set of 8 audit event types
specified in the PRD.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies that the run_id is preserved across key rotations and that
audit events are logged against the correct run_id.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tological assertion

- runner.py: Track actual sleep duration instead of hardcoded 60s in codex
  polling loop, preventing waited counter from exceeding real elapsed time
- key_pool.py: Move audit logging and status updates outside asyncio.Lock
  scope in wait_for_next_available_key to avoid holding lock during slow
  DB writes
- test_key_pool_integration.py: Replace tautological assertion in
  concurrent deadlock test with meaningful checks on result count

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RunManager.start_run() now uses KeyPool.get_next_key() to assign each
parallel worker a distinct API key from the pool. The LRU tiebreaker
in get_next_key() naturally distributes keys across workers spawned in
sequence. Falls back to the raw credential if the pool is empty.

Since all workers share the same SQLite DB, rate limit state written by
one worker's KeyPool is immediately visible to other workers — Worker B
will skip a key that Worker A just marked as rate-limited.

Adds 3 integration tests:
- test_parallel_workers_get_different_keys
- test_worker_rate_limit_visible_to_other_workers
- test_single_key_shared_across_workers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements PRD Section 6 security requirement: rate limiting on key
management endpoints to prevent brute-force key extraction.

- New agent/rate_limit.py: sliding window RateLimiter class (~50 lines)
  with injectable time function for testability
- All 7 /keys/* routes in endpoints.py protected via FastAPI Depends
- All 7 /api/keys/* proxy routes in monitor/app.py also protected
  with a separate limiter instance (prevents bypass via monitor)
- Returns HTTP 429 with Retry-After header when limit exceeded
- 3 new tests: under threshold, at threshold (429), window reset

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Health endpoint reads from signals.current_run_id, not main._current_run_id.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…, worker detection, env cleanup

- key_pool.py: wrap delete_key in asyncio.Lock to prevent TOCTOU race deleting last enabled key
- run_manager.py: share single KeyPool instance across worker launches for proper LRU distribution
- run_manager.py: restore WORKER_MODE=1 env var for worker container detection
- endpoints.py: use module-level _key_pool singleton so asyncio.Lock actually protects state
- runner.py: clean up CLAUDE_CODE_OAUTH_TOKEN from os.environ in finally blocks
- permissions.py: block bare env/printenv commands that dump all environment variables

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant