A curated, practitioner-focused list of tools, frameworks, datasets, and resources for AI red teaming, adversarial ML, LLM security, and AI governance. Categorized by where each fits in a real workflow — not just by what it claims to do.
For a hands-on "where do I actually start" walkthrough mapped to an AI attack lifecycle, see TOOLING.md.
Links rot fast in this space. Each section is dated. Entries marked ⚠ deprecated are kept for context — don't build on them.
- How to Use This List
- 1. LLM Red Teaming & Scanning
- 2. Prompt Injection & Jailbreak Research
- 3. Agentic AI & MCP Attack Surface
- 4. RAG & Vector Store Attacks
- 5. Multimodal Attacks
- 6. Adversarial Machine Learning (Classical)
- 7. Model Extraction & Privacy Attacks
- 8. Data & Model Poisoning / Backdoors
- 9. Supply Chain & Model File Scanning
- 10. Guardrails & Runtime Defenses
- 11. Evaluation Harnesses & Benchmarks
- 12. Bias, Fairness & Interpretability
- 13. MLOps / Deployment Security
- 14. Standards, Frameworks & Compliance
- 15. Incident & Vulnerability Databases
- 16. Playgrounds & CTFs
- 17. Bug Bounty & Disclosure Programs
- 18. Further Reading
- Contributing
- License
If you're new to AI red teaming, a sensible starting path:
- Read OWASP LLM Top 10 (2025) and MITRE ATLAS to get the threat model.
- Run garak against a target LLM to see what automated scanning gets you.
- Try PyRIT for orchestrated multi-turn attacks.
- Walk through Gandalf (or any CTF in section 16) for hands-on prompt injection.
- For applied work, read NIST AI 600-1 and NIST AI 100-2 to align findings to risk language stakeholders understand.
For deeper workflow guidance, see TOOLING.md.
Automated and semi-automated tooling for testing LLM systems.
- PyRIT — Microsoft's Python Risk Identification Tool. Orchestrated, multi-turn red teaming for generative AI. Probably the most full-featured open-source framework right now.
- garak — NVIDIA's LLM vulnerability scanner. Probe-and-detector architecture, dozens of attack categories (encoding tricks, DAN variants, prompt leak, toxicity, RealToxicityPrompts, etc.). Think
nmapfor LLMs. - promptfoo — LLM evaluation + red teaming. Strong YAML-driven test harness, OWASP LLM Top 10 preset built in.
- DeepEval — pytest-style LLM evaluation framework with red team modules.
- Inspect AI — UK AISI's evaluation framework. Increasingly the standard for safety evals and dangerous-capability testing.
- Mantis (verify before use — research project, lightly maintained) — Trail of Bits' framework for LLM adversarial testing
- TextAttack — Adversarial attacks on NLP models (still useful for classifier-style targets; less relevant to large autoregressive models).
- ⚠ Counterfit — Microsoft archived this in 2023 in favor of PyRIT. Don't start here.
Tools, payload collections, and research benches for prompt injection and jailbreaking specifically.
- Promptmap2 — Automated prompt injection scanner for LLM apps.
- L1B3RT4S — Plinius the Liberator's jailbreak prompt collection. Updated frequently; widely used as a corpus.
- Many-shot jailbreaking — Anthropic research on long-context jailbreaks
- Crescendo — Microsoft Research multi-turn jailbreak technique
- Skeleton Key — Microsoft-disclosed jailbreak class
- JailbreakBench — standardized jailbreak benchmark
- HarmBench — automated red teaming benchmark from CAIS
- AdvBench — GCG-style suffix attacks; companion to the original Zou et al. "universal and transferable adversarial attacks" paper
- BurpGPT — Burp Suite extension that integrates LLM analysis into web testing flows
Tools and benches for agent systems and Model Context Protocol — the hot 2025–2026 attack surface.
- AgentDojo — Benchmark for agent prompt injection attacks (ETH Zurich)
- InjecAgent — Indirect prompt injection benchmark for tool-using agents
- Agent Security Bench (ASB) — Formal benchmark for LLM agent security
- τ-bench — Tool-agent-user benchmark (Sierra)
- mcp-scan (verify — fast-moving space) — Invariant Labs' scanner for MCP server risks
- OWASP Top 10 for Agentic AI (2025) — threats and mitigations document
- NIST AI RMF Agentic Profile (draft, CSA Labs) — extension to AI RMF for autonomous agents
- PoisonedRAG — Knowledge corruption attacks against RAG pipelines
- AgentPoison — Memory-poisoning attacks on agent RAG memory
- ConfusedPilot — RAG-based Copilot attack class
- See also: section 1 tools (garak, promptfoo) have RAG-specific probes/presets
- MM-SafetyBench — Vision-language model safety benchmark
- VLAttack — Visual-language adversarial attacks
- Voice Jailbreak Attacks — Audio modality attacks
- FigStep — Image-encoded prompt injection
- Adversarial patch / image attacks — see ART in section 6 for the classical toolkit
Pre-LLM-era adversarial example tooling. Still relevant for classifiers, vision, and recommender models.
- Adversarial Robustness Toolbox (ART) — IBM/LF AI; the most comprehensive adversarial ML library. 39+ attacks, 29+ defenses.
- Foolbox — Adversarial examples for PyTorch/TensorFlow/JAX models
- CleverHans — Benchmarking adversarial defenses (note: moved to
cleverhans-laborg) - SecML — ML security against adversarial / poisoning / evasion attacks
- ML Privacy Meter — Membership inference, attribute inference, model inversion
- Model Inversion Attack Toolbox — Reconstruct training data from model outputs
- TensorFlow Privacy — Differential-privacy training + attack benchmarks
- Opacus — DP-SGD for PyTorch
- Training data extraction research — see Carlini et al. on extractable training data for current technique baselines
- BackdoorBench — Comprehensive backdoor attack/defense benchmark
- TrojanZoo — Backdoor and adversarial robustness library
- BadNets — Classic backdoor reference implementation
- Snorkel (formerly recommended for "poisoning detection" — that framing was loose) —
snorkel.org; data labeling tool, not strictly a security tool
The pickle problem hasn't gone away; safetensors has, but compromised models still ship.
- ModelScan — ProtectAI scanner for ML model serialization formats (pickle, H5, SavedModel, GGUF, etc.)
- picklescan — Static analysis for malicious pickle files
- Fickling — Trail of Bits' pickle decompiler and security analyzer
- HuggingFace Picklescan integration — built-in scanning on the Hub
- Stable Diffusion Pickle Scanner + GUI — SD ecosystem-specific
- Guardian (commercial) — ProtectAI's enterprise model scanning
- HiddenLayer Model Scanner (commercial) — model file scanning + attack telemetry
For builders, but red teamers should know what they're testing against.
- LLM Guard — ProtectAI's input/output scanner toolkit (was
laiyer-ai/llm-guard— repo moved). 15 input + 20 output scanners. - NeMo Guardrails — NVIDIA's programmable guardrails (Colang DSL)
- Rebuff — Prompt injection detection (heuristics + LLM classifier + vector DB + canary tokens)
- Guardrails AI — Validation layer for LLM I/O
- Vigil — LLM prompt injection scanner
- LangKit — WhyLabs' LLM telemetry / monitoring with safety signals
- Lakera Guard (commercial, free tier) — hosted prompt injection / data loss API
- Llama Guard 3 / 4 — Meta's safety classifier models (open weights)
Red teaming and eval converge. Use these to baseline a model and measure post-mitigation deltas.
- lm-evaluation-harness — EleutherAI's standard eval harness (200+ tasks)
- HELM — Stanford CRFM's holistic evaluation
- OpenAI Evals — Eval framework + registry
- Inspect AI — UK AISI; cross-listed with section 1
- CyberSecEval — Meta's cyber-risk benchmark for LLMs (insecure code, cyberattack helpfulness, etc.)
- METR Task Suite — autonomy / dangerous-capability evals
- SWE-bench — code-agent benchmark (relevant for evaluating code-writing capability that matters in security contexts)
- AI Fairness 360 (AIF360) — IBM; 70+ fairness metrics, 10 bias mitigation algorithms
- Fairlearn — Microsoft fairness assessment + mitigation
- Captum — PyTorch model interpretability
- SHAP — Shapley value explanations
- LIME — Local interpretable model-agnostic explanations
- Transformer Circuits — Anthropic's interpretability research (reference, not tooling)
- MLflow — Experiment tracking; check security advisories, several CVEs over 2023–2024
- Trivy — Container & filesystem scanner; works on AI containers too
- Kubescape — Kubernetes hardening
- NB Defense — Jupyter notebook security scanner
- Morpheus — NVIDIA cybersecurity AI pipeline (anomaly detection, etc.) — note: defensive/SOC framing, not red team
Cross-cutting frameworks
- MITRE ATLAS — Adversarial threat landscape for AI systems. Counterpart to ATT&CK; the threat model most red teamers map findings to.
- OWASP Top 10 for LLM Applications (v2025) — Current version; covers prompt injection, supply chain, system prompt leakage, vector/embedding weaknesses, unbounded consumption, etc.
- OWASP Top 10 for Agentic AI (2025) — Agent-specific risk taxonomy
- OWASP ML Security Top 10 — Classical ML risks
- OWASP AI Exchange — Comprehensive AI security & governance reference
NIST
- NIST AI RMF 1.0 (AI 100-1) — Core AI risk management framework
- NIST AI 600-1 — Generative AI Profile of the AI RMF (July 2024). 12 risk categories, 200+ suggested actions.
- NIST AI 100-2 E2025 — Adversarial Machine Learning: A Taxonomy and Terminology
- NIST SP 800-218A — Secure Software Development Practices for Generative AI / Dual-Use Foundation Models
- Dioptra — NIST's testbed for assessing ML attack effects
International / Regulatory
- EU AI Act — In force progressively from Feb 2025; risk-tiered obligations (unacceptable / high / limited / minimal)
- ISO/IEC 42001:2023 — AI Management Systems standard
- ISO/IEC 23894:2023 — AI risk management
- ISO/IEC 27090 (in development) — Guidance on addressing security threats to AI systems
- UK AI Safety Institute — Publications and frameworks
- Singapore Model AI Governance Framework
- GDPR & AI — Article 22 (automated decisions), and broader DPIA expectations
- Korea AI Basic Act (2025) (verify current URL — recent legislation)
Vendor & Industry
- Google Secure AI Framework (SAIF)
- Microsoft Responsible AI Standard
- Anthropic's Responsible Scaling Policy
- Frontier Model Forum — Industry body output (Anthropic, Google, Microsoft, OpenAI)
Note on US executive actions: EO 14110 (Biden, 2023) was revoked in January 2025 and replaced by Executive Order 14179 ("Removing Barriers to American Leadership in Artificial Intelligence"). NIST AI 600-1 was published under EO 14110 but the document itself remains a NIST publication and is still in use. The US executive landscape is shifting; verify current EO and OMB guidance before relying on either.
- AI Vulnerability Database (AVID) — Open knowledge base of failure modes in GPAI systems; pivoted in 2025–2026 to focus on agentic / system-level vulns. Maps to MITRE ATLAS, CVSS, and AVID's own taxonomy.
- AI Incident Database (AIID) — Real-world AI incidents (Partnership on AI)
- OECD AI Incidents Monitor — Tracks AI-related incidents internationally
- MITRE ATLAS Case Studies — Real attacks mapped to the ATLAS matrix
For practice, training, and demonstrating attacks safely.
- Gandalf — Lakera's prompt-injection CTF, 8 levels + extras. The classic starter.
- Tensor Trust — Prompt injection / defense game (Berkeley)
- Doublespeak — Jailbreak chat-style challenges
- Spy Logic — Immersive Labs LLM challenges
- PortSwigger Web Security Academy — LLM labs — Practical hands-on labs for LLM-integrated web apps
- DEF CON AI Village Generative Red Team — Annual large-scale public red team event
- HackTheBox AI-themed boxes — Various AI-themed challenges across HTB and HTB Academy
- Prompt Airlines — Wiz's prompt injection CTF
- MyLLMBank — LLM stress-test playground
- Hugging Face Spaces — Host of community AI demos; many double as test targets
- 0Din by Mozilla — Generative AI–focused bug bounty
- Anthropic Responsible Disclosure / Bug Bounty — Public program covering Claude
- OpenAI Bug Bounty — Covers OpenAI products/infra; note model-output issues have a separate process
- Google VRP (AI scope) — AI vulnerabilities included in main VRP
- Microsoft AI Bounty — Covers Copilot and AI products
- Meta Bug Bounty — AI/Llama scope expanded 2024+
- Hugging Face Security (disclosure, not paid bounty by default)
- xAI Bug Bounty (verify current scope) — Grok and related
- Bugcrowd AI Security — Platform with multiple AI-targeted programs
- HackerOne AI Safety — Platform-hosted programs and challenges
- Apple Security Bounty (AI scope) — Apple Intelligence is now in scope
- Lilian Weng — Adversarial Attacks on LLMs — Survey-quality overview
- Simon Willison's prompt injection writing — The single most prolific public commentator on the topic; start here
- Anthropic's red teaming research — Particularly the constitutional classifiers and many-shot jailbreaking papers
- OpenAI Safety Research
- Google DeepMind safety publications
- NVIDIA AI Red Team blog
- AI Snake Oil — Useful counterweight to hype; critical thinking on AI claims
- Embrace the Red (Johann Rehberger) — Practical prompt injection / agent exploit research
PRs welcome. Useful contributions:
- New tools that meaningfully change a workflow (not just another wrapper)
- Replacements for deprecated tools
- Updates to standards / frameworks
- Broken-link fixes (please verify before submitting)
Please keep entries tight, cite primary sources, and note if something is research-quality vs production-ready vs commercial.
MIT — use freely, attribution appreciated.
This is a living document. The AI security field moves quickly; expect quarterly-ish refreshes.