feat(security): tool-call spike anomaly detector (T-AD1)#308
Open
Destynova2 wants to merge 1 commit into
Open
Conversation
Adds a sliding-window counter keyed by session id (with fallback to user id and tenant id) so that runaway Claude Code sessions cannot exhaust provider quotas. Two configurable thresholds: a warn level that emits a metric and a log line, and a block level that returns HTTP 429 plus a signed audit entry. Defaults are conservative: warn at 100 tool calls/min/session (matches a busy build run reading ~2 files/sec) and block at 500/min (equivalent to >8/sec sustained — only a runaway loop produces this). Implementation: - src/security/tool_spike.rs: new module. 60-bucket ring (1s each) for the rolling window; lazy bucket aging (no background task); saturating-add on every counter to handle malicious overflow. - src/security/mod.rs: register module, re-export public types. - src/security/audit_log.rs: new ToolSpikeBlocked AuditEvent variant. - src/cli/config/security.rs: tool_spike_warn_per_min (default 100) and tool_spike_block_per_min (default 500) on [security]. - src/server/init.rs: init_tool_spike_detector helper, returns None when both thresholds are zero (full disable). - src/server/mod.rs: SecurityState gains tool_spike_detector field. - src/server/error.rs: AppError::RateLimited maps to HTTP 429 with type=rate_limited. - src/server/dispatch/mod.rs: check_tool_spike runs in step 1.4, after DLP and before routing — DLP scope blocks still take precedence; the limiter cannot be bypassed by re-routing. Tests: - 11 unit tests in security::tool_spike covering allow / warn / block paths, the 60s decay boundary, partial-decay correctness across bucket rotation, key-resolution priority (session_id > user_id > tenant fallback), session reset on end-of-life, and idle-cleanup of stale keys. - AppError::RateLimited integration test: verifies HTTP 429 + type=rate_limited body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
Destynova2
pushed a commit
that referenced
this pull request
May 15, 2026
Closes 3 multi-tenant isolation gaps flagged by PR #326's regression suite. The 3 #[ignore]'d tests now pass. * Per-tenant budget: SpendTracker grows check_tenant_budget(tenant,...) backed by per-tenant in-memory caches and per-tenant JSONL journals at spend/<tenant>/<YYYY-MM>.jsonl. record_tenant no longer also accumulates into the global counter, so a tenant overspend cannot trip the global budget for other tenants. * Per-tenant secrets: SecretBackend::get takes (tenant, name). All three backends (LocalEncrypted, Env, File) resolve <tenant>/<name> first and fall back to the legacy flat layout for global secrets so single-tenant deployments need no migration. * strict_tenant flag: new [security] strict_tenant config field plus tenant_required_middleware that returns HTTP 400 with a missing_tenant body when the flag is on and no tenant id can be resolved (X-Tenant-ID header, JWT claim, or virtual key). * Tenant id source priority: VirtualKeyContext > JWT claim > X-Tenant-ID header. Header is only consulted when no authenticated tenant exists, so a client cannot impersonate a JWT tenant. API breaking changes (all internal except SecretBackend trait): - SecretBackend::get: now takes &str tenant before &str name. - SpendTracker grows check_tenant_budget plus DEFAULT_TENANT bucket. - SpendTracker::record_tenant stops also calling self.record(...). Tests: 1345 nextest, fmt+clippy clean. The 3 previously #[ignore]'d multi_tenant_isolation tests now pass alongside the 4 already-passing ones (7 active, 1 still ignored for ToolSpikeDetector from PR #308). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds per-session tool-call spike anomaly detection so a misbehaving Claude Code session cannot exhaust provider quotas or trigger billing surprises.
session_id→user_id→tenant_id→\"anon\". Old samples drop out automatically; no background task.[security]:tool_spike_warn_per_min(default 100): logs + emitsgrob_tool_spike_warn_total.tool_spike_block_per_min(default 500): returns HTTP 429 (AppError::RateLimited), writes a signedToolSpikeBlockedaudit entry, and emitsgrob_tool_spike_blocked_total.0(the detector returnsNonefrominit_tool_spike_detectorand the dispatch step is a no-op).Integration point
Runs as step 1.4 in
dispatch::dispatch()— after DLP scanning (so DLP scope blocks still take precedence) and before routing (so a runaway client cannot exhaust provider quotas before the spike is observed).Files
src/security/tool_spike.rs(new): detector + bucket ring + helpers.src/security/mod.rs: register module + re-export.src/security/audit_log.rs: newToolSpikeBlockedAuditEventvariant.src/cli/config/security.rs:tool_spike_{warn,block}_per_minconfig fields.src/server/init.rs:init_tool_spike_detectorhelper.src/server/mod.rs:SecurityState.tool_spike_detector.src/server/error.rs:AppError::RateLimited→ HTTP 429 withtype=rate_limited.src/server/dispatch/mod.rs:check_tool_spike()dispatch step.Test plan
security::tool_spike::testscover:reset_session()clears counter on session-endcleanup_idle()drops keys idle > 60sCanonicalRequest(tool_use+tool_result)session_id>user_id> tenant fallback >\"anon\"AppError::RateLimitedintegration test verifies HTTP 429 +rate_limitedbody type.cargo clippy --tests --lib -- -D warningsclean.cargo fmt --checkclean.Notes for reviewers
commands::setup::writer::tests::test_w2_strip_fallback_removes_openrouter_and_mappingstest is pre-existing breakage onmainHEAD09fe074(theperf.tomlpreset no longer ships openrouter). Verified by stashing this PR's diff and running the test on stockmain: still fails. Out of scope for a single-purpose PR; will land via the existingfix/preset-mod-include-strwork.presets/{cheap,fast,local,medium}.toml) created locally to make the brokeninclude_str!references insrc/preset/mod.rscompile are NOT part of this PR's diff — that's also covered by the standalone preset fix branch.🤖 Generated with Claude Code