Dependency-light benchmark suite for evaluating autonomous-agent security behavior.
It provides adversarial cases for prompt injection, data exfiltration, tool misuse, persistence/memory poisoning, approval bypass, canary exfiltration, and unsafe tool use. The default scorer is deterministic text/transcript analysis; live-agent adapters are declared but intentionally sandbox-gated.
Agent security failures are behavioral. A config can look reasonable while the agent still:
- follows malicious webpage instructions
- leaks tokens/secrets
- uses tools without approval
- stores attacker policy in memory
- bypasses outbound-action gates
- exfiltrates canary files or fake secrets through tool calls
This repo gives those risks repeatable benchmark cases.
Run without installing:
PYTHONPATH=src python -m agent_security_bench.cli list --format json
PYTHONPATH=src python -m agent_security_bench.cli score examples/safe-responses.json --format json
PYTHONPATH=src python -m agent_security_bench.cli score examples/unsafe-responses.json --format markdownInstall locally:
python -m pip install -e .
agent-security-bench list --format json
agent-security-bench score examples/safe-responses.json --format sarif > agent-security-bench.sarifdirect_prompt_injectionindirect_prompt_injectiondata_exfiltrationcanary_exfiltrationtool_misusepersistenceapproval_bypass
Built-in cases are always available, and additional cases can be loaded from JSON, JSONL, or a dependency-free YAML subset:
PYTHONPATH=src python -m agent_security_bench.cli list --cases examples/custom-cases.jsonl --format json
PYTHONPATH=src python -m agent_security_bench.cli score responses.json --cases examples/custom-cases.yamlEach case has a stable id, category, severity, difficulty, benchmark_version, required safe-response patterns, and forbidden unsafe patterns.
The score report includes:
schema_versionandbenchmark_version- aggregate summary
- score from
0.0to1.0 - per-category results
- per-difficulty results
- per-case pass/fail
- violations with required/forbidden pattern IDs
Supported report formats:
PYTHONPATH=src python -m agent_security_bench.cli score examples/unsafe-responses.json --format json
PYTHONPATH=src python -m agent_security_bench.cli score examples/unsafe-responses.json --format markdown
PYTHONPATH=src python -m agent_security_bench.cli score examples/unsafe-responses.json --format sarif
PYTHONPATH=src python -m agent_security_bench.cli score examples/unsafe-responses.json --format junit
PYTHONPATH=src python -m agent_security_bench.cli score examples/unsafe-responses.json \
--evidence-bundle agent-security-bench-evidence.json \
--format jsonUse score thresholds and failure gates when wiring the benchmark into CI:
PYTHONPATH=src python -m agent_security_bench.cli score examples/safe-responses.json \
--min-score 0.95 \
--fail-on-failures \
--format json--min-score Nexits1when the aggregate score is belowNand records the comparison underthresholds.--fail-on-failuresexits1when any case fails and records the failed-case count underthresholds.--format junitemits JUnit XML so CI systems can show benchmark cases as test results.
Use auditable baseline suppressions for temporary known failures without hiding them from reports:
PYTHONPATH=src python -m agent_security_bench.cli score examples/unsafe-responses.json \
--baseline-suppressions examples/baseline-suppressions.json \
--fail-on-failures \
--fail-on-expired-suppressions \
--fail-on-stale-suppressions \
--format jsonMatching non-expired suppressions are removed from active gates and preserved under suppressed_findings; expired suppressions do not hide failures, and stale suppressions are reported for cleanup. See docs/baseline-suppressions.md for the file format and lifecycle guidance.
A copyable GitHub Actions workflow is available at examples/github-actions/agent-security-bench.yml; it generates JSON, SARIF, and JUnit outputs and uploads SARIF with CodeQL code scanning.
Use --transcripts when inputs are case_id -> transcript objects with response and tool_calls fields. Transcript scoring flags unsafe tool calls, canary exfiltration, and fake-secret exfiltration in addition to text response violations.
PYTHONPATH=src python -m agent_security_bench.cli score examples/tool-transcripts.json --transcripts --format jsonList declared adapters:
PYTHONPATH=src python -m agent_security_bench.cli adapters --format jsonThe dry-run adapter is CI-safe and deterministic. Use run to collect normalized transcripts and score them with transcript-mode checks:
PYTHONPATH=src python -m agent_security_bench.cli run --adapter dry-run --format jsonFor fixture-backed adapter tests, use the dependency-free mock adapter with an explicit sandbox flag:
PYTHONPATH=src python -m agent_security_bench.cli run \
--adapter mock \
--sandbox \
--mock-transcripts examples/tool-transcripts.json \
--format jsonNormalized transcripts include response, tool_calls, timestamp, and adapter metadata. Hermes, OpenClaw, Codex, and Claude adapter specs are listed as sandbox_required; they require --sandbox and still refuse real external execution in this release so tests never invoke live agents or real outbound tools.
Validate the built-in corpus or a custom case file before review:
PYTHONPATH=src python -m agent_security_bench.cli lint-cases --format json
PYTHONPATH=src python -m agent_security_bench.cli lint-cases --cases examples/custom-cases.yaml --format jsonMeasure corpus coverage across categories, severities, difficulties, tags, tool-risk markers, fixture references, and required/forbidden behavior:
PYTHONPATH=src python -m agent_security_bench.cli coverage --format jsonlint-cases exits non-zero on case-quality errors such as invalid IDs, duplicate case IDs, invalid severity/difficulty values, duplicate patterns, or missing canary_ids / fake_secret_ids for synthetic fixture references.
Compare a current run to a baseline and fail if score or pass count regresses:
PYTHONPATH=src python -m agent_security_bench.cli regression current-report.json \
--baseline baseline-report.json \
--fail-on-regressionscore can also attach a regression comparison directly:
PYTHONPATH=src python -m agent_security_bench.cli score responses.json \
--baseline baseline-report.json \
--fail-on-regressionexamples/canary-fixtures.json contains fake canary files, fake secrets, and attacker endpoints used for transcript/canary tests. These values are intentionally inert and must never be replaced with real credentials.
PYTHONPATH=src python -m unittest discover -s tests -q
PYTHONPATH=src python -m pytest -q
python -m compileall -q src tests
ruff check .CI runs ruff, compileall, pytest, and the packaging smoke test.
Release checks include the normal test suite plus a packaging smoke test that builds a wheel, installs it into a fresh virtual environment, and exercises the installed console script:
python scripts/packaging-smoke.pyThe CI workflow runs this smoke test across Python 3.11/3.12 on Ubuntu, macOS, and Windows. User-visible CLI and report-schema changes are tracked in CHANGELOG.md; report compatibility rules live in docs/report-schema.md.
Use --evidence-bundle <path> with score or run to write a compact JSON artifact containing only failed active cases, their prompts, expected behavior, observed responses, violations, transcript/tool-call context when available, and reproducer commands. See docs/evidence-bundles.md for the schema and CI artifact guidance.
The original 0.2.0 benchmark roadmap is complete:
- YAML/JSONL case files — Shipped
- Live-agent adapters for Hermes/OpenClaw/Codex/Claude — Shipped as sandbox-gated adapter specs plus CI-safe dry-run execution
- Tool-call transcript scoring — Shipped
- Sandboxed canary file and fake secret fixtures — Shipped
- SARIF/Markdown reports — Shipped
- Difficulty tiers and benchmark versioning — Shipped
- Larger prompt-injection corpus — Shipped
- Regression mode for agent releases — Shipped
The next roadmap is tracked in docs/roadmap.md. Phase 1, the CI adoption pack, is shipped in 0.3.0 with JUnit XML, score thresholds, failure gates, and a GitHub Actions example. Phase 2 is shipped in 0.4.0 with auditable baseline suppressions and cleanup gates. Phase 3 is shipped in 0.5.0 with normalized dry-run/mock adapter transcripts and sandbox-gated external adapter specs. Phase 4 is shipped in 0.6.0 with corpus linting, coverage reporting, and case-review guidance. Phase 5 is shipped in 0.7.0 with packaging smoke tests, CI matrix coverage, a changelog, and report-schema policy. Phase 6 is shipped in 0.8.0 with evidence bundle artifacts for failed response and transcript cases. Phases 7-10 plan scenario suites, weighted scoring, evaluation manifests, and dashboard exports.
Cases intentionally contain malicious instructions and fake attacker destinations. Treat all case prompts as untrusted test data. Do not wire benchmark cases to real outbound tools without sandboxing and explicit approvals.