feat: multi-key rotation & intelligent rate-limit recovery by heemzers · Pull Request #65 · SignalPilot-Labs/SignalPilot

heemzers · 2026-04-03T17:12:17Z

Summary

Multi-key pool: rotate across multiple Claude Code OAuth keys automatically on rate limit, zero downtime when a second key is available.

Key changes

Key pool and rotation (agent/key_pool.py): encrypted key storage, priority-based selection, LRU distribution, auto-wait for rate limit recovery
Auto-wait and resume: when all keys exhausted, agent sleeps until earliest key resets
Optional Codex fallback: off by default
Full observability: 8 new audit event types
Security: keys encrypted at rest (Fernet), masked in API responses, env cleanup
Rate limiting: 10 req/min on /keys/* endpoints
RunManager integration: parallel workers get distinct keys via shared KeyPool
Backward compatible: single-token auto-migration

Review fixes (`c039061`)

TOCTOU race in delete_key wrapped in asyncio.Lock
Shared KeyPool singleton in RunManager
Restored WORKER_MODE=1 env var
Module-level KeyPool singleton in endpoints
os.environ cleanup in finally blocks

Pre-Landing Review

5 critical issues found and fixed. 7 informational findings logged.

Test plan

552/552 tests pass
Verification gate passed after review fix commit

Covers 9 scenarios: initialization with run_id, single-token migration, key rotation on rate limit, exhausted pool returning None, auto-wait with mocked sleep, auto-wait disabled, max-wait exceeded, pool status counts, and audit event logging for key_rate_limited / key_rotated / key_pool_waiting / key_pool_resumed events. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Covers POST /keys (add, encrypt, mask, validation), GET /keys (list, mask, status fields), PATCH /keys/{id} (partial update, not-found), DELETE /keys/{id} (removes key, 409 for last enabled, 404 for missing), GET /keys/status (pool structure, counts, masked keys), GET /keys/config (defaults), and documents the PATCH /keys/config routing bug where FastAPI resolves "config" as a key_id before the dedicated config handler is reached. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…it recovery Covers database CRUD (add/list/update/delete with encryption), key selection (priority, LRU, disabled/rate-limited skipping, provider isolation), rotation logic (ping-pong, codex fallback, audit logging), config defaults/updates, one-time migration from settings.claude_token, and auto-wait behaviour with mocked asyncio.sleep. Tests use real temporary SQLite databases via agent.db.init_db and a temporary Fernet master key; key_pool.MASTER_KEY_PATH and monitor.crypto._fernet are patched per-test to ensure isolation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add a key pool system that rotates across multiple Claude Code OAuth keys on rate limit, with optional Codex fallback and auto-wait for cooldown recovery. This eliminates manual intervention when rate limits are hit. New files: - key_pool.py: KeyPool class with priority-based selection, rotation, auto-wait, config management, and single-token migration - codex_client.py: Thin OpenAI Codex API adapter (off by default) - test_key_pool.py: 58 unit tests for key pool logic - test_key_pool_integration.py: 9 integration tests - test_codex_client.py: 16 unit tests for Codex adapter - test_key_endpoints.py: 34 endpoint tests Modified files: - db.py: Add api_keys + key_rotation_config tables - runner.py: Replace rate limit handler with key rotation, add SDK client restart loop for seamless key swaps - endpoints.py: Add /keys/* CRUD and /keys/config management routes - monitor/app.py: Add proxy routes + matching schema tables - permissions.py: Block agent reads of key pool data 602/603 tests pass (1 pre-existing failure unrelated to this change). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The runner was calling mark_rate_limited() explicitly then calling handle_rate_limit() which also calls it internally. This caused the resets_at/utilization to be overwritten with None on the second call. Also removed duplicate audit log entries — key_pool.py is now the single source of truth for key rotation and fallback audit events. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…eding - Re-fetch key from DB after auto-wait sleep to avoid stale state - Add signals.has_pending_signals() public method instead of accessing _signal_queue internals from runner.py - Move config seeding flag to instance level to avoid test pollution - Log decrypt failures in list_keys instead of silently masking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…l_requests, wire Codex mode - Add asyncio.Lock to KeyPool for task-safe concurrent access - Add __repr__ to CodexClient that masks API key (prevents leaks in tracebacks) - Rename misleading total_requests to rate_limit_hits (only incremented on limit) - Wire Codex degraded mode into runner: polls for Claude key recovery every 60s instead of falling through as a no-op - Add 3 new tests for CodexClient repr masking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…gration tests - Fix monitor proxy route ordering: /api/keys/status and /api/keys/config must be registered before /api/keys/{key_id} to avoid matching "status" and "config" as key_id parameters (same fix applied earlier to agent endpoints) - Add DB migration for total_requests -> rate_limit_hits column rename (handles databases created during intermediate testing) - Add 3 integration tests: codex fallback enabled, codex fallback disabled, concurrent handle_rate_limit no-deadlock Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…handling 4 tests that exercise _run_loop directly with mocked SDK client: - Key rotation on rate limit (switches to second key, restarts client) - Pause when all keys exhausted (auto_wait disabled) - Codex fallback polls for Claude key recovery - Legacy single-key path (no key pool) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ock docs - Remove unnecessary CodexClient instantiation in runner (kept decrypted key in memory for entire polling loop with no usage) - Check signals before sleeping in Codex degraded mode (was sleeping 60s before first signal check, now checks immediately) - Sleep in 10s chunks instead of 60s for responsive signal handling - Log audit event when Codex fallback ends due to signal - Add concurrency safety notes to mark_rate_limited/clear_rate_limit docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…king Addresses security audit findings: - M1: Add allow-list validation in update_config() — rejects any config key not in DEFAULT_CONFIG, preventing arbitrary key injection - M4: Block Bash access to /data/ directory (master key, databases) in permissions.py — covers cat, head, tail, less, cp, xxd, base64 - Add test for unknown config key rejection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Addresses security audit M1: the monitor proxy routes used `body: dict` which bypassed Pydantic validation, allowing arbitrary JSON to flow through to the agent. Now uses typed models: AddKeyProxyRequest, UpdateKeyProxyRequest, RotationConfigProxyUpdate — matching the agent endpoint models and providing input validation at the proxy layer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Verifies that arbitrary key-value pairs in update_config are rejected at the KeyPool level, preventing injection into key_rotation_config table. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Logs a key_pool_exhausted event with earliest_reset, wait_minutes, and active_key_id when all keys are rate-limited and auto-wait is exhausted or disabled. Completes the full set of 8 audit event types specified in the PRD. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Verifies that the run_id is preserved across key rotations and that audit events are logged against the correct run_id. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tological assertion - runner.py: Track actual sleep duration instead of hardcoded 60s in codex polling loop, preventing waited counter from exceeding real elapsed time - key_pool.py: Move audit logging and status updates outside asyncio.Lock scope in wait_for_next_available_key to avoid holding lock during slow DB writes - test_key_pool_integration.py: Replace tautological assertion in concurrent deadlock test with meaningful checks on result count Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

RunManager.start_run() now uses KeyPool.get_next_key() to assign each parallel worker a distinct API key from the pool. The LRU tiebreaker in get_next_key() naturally distributes keys across workers spawned in sequence. Falls back to the raw credential if the pool is empty. Since all workers share the same SQLite DB, rate limit state written by one worker's KeyPool is immediately visible to other workers — Worker B will skip a key that Worker A just marked as rate-limited. Adds 3 integration tests: - test_parallel_workers_get_different_keys - test_worker_rate_limit_visible_to_other_workers - test_single_key_shared_across_workers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Implements PRD Section 6 security requirement: rate limiting on key management endpoints to prevent brute-force key extraction. - New agent/rate_limit.py: sliding window RateLimiter class (~50 lines) with injectable time function for testability - All 7 /keys/* routes in endpoints.py protected via FastAPI Depends - All 7 /api/keys/* proxy routes in monitor/app.py also protected with a separate limiter instance (prevents bypass via monitor) - Returns HTTP 429 with Retry-After header when limit exceeded - 3 new tests: under threshold, at threshold (429), window reset Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Health endpoint reads from signals.current_run_id, not main._current_run_id. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…, worker detection, env cleanup - key_pool.py: wrap delete_key in asyncio.Lock to prevent TOCTOU race deleting last enabled key - run_manager.py: share single KeyPool instance across worker launches for proper LRU distribution - run_manager.py: restore WORKER_MODE=1 env var for worker container detection - endpoints.py: use module-level _key_pool singleton so asyncio.Lock actually protects state - runner.py: clean up CLAUDE_CODE_OAUTH_TOKEN from os.environ in finally blocks - permissions.py: block bare env/printenv commands that dump all environment variables Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Self-Improve Bot and others added 21 commits April 3, 2026 16:09

test: add endpoint-level test for unknown config key rejection

1a776a9

Verifies that arbitrary key-value pairs in update_config are rejected at the KeyPool level, preventing injection into key_rotation_config table. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

test: add test_rotation_preserves_run_id for key rotation audit

2766487

Verifies that the run_id is preserved across key rotations and that audit events are logged against the correct run_id. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: patch correct mock target in test_health_idle

d1dce53

Health endpoint reads from signals.current_run_id, not main._current_run_id. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into feat/multi-key-rotation

ff8c00a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multi-key rotation & intelligent rate-limit recovery#65

feat: multi-key rotation & intelligent rate-limit recovery#65
heemzers wants to merge 21 commits intomainfrom
feat/multi-key-rotation

heemzers commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

heemzers commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Review fixes (c039061)

Pre-Landing Review

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

heemzers commented Apr 3, 2026 •

edited

Loading

Review fixes (`c039061`)