AI Red Teaming & Security Tools

A curated, practitioner-focused list of tools, frameworks, datasets, and resources for AI red teaming, adversarial ML, LLM security, and AI governance. Categorized by where each fits in a real workflow — not just by what it claims to do.

For a hands-on "where do I actually start" walkthrough mapped to an AI attack lifecycle, see TOOLING.md.

Links rot fast in this space. Each section is dated. Entries marked ⚠ deprecated are kept for context — don't build on them.

How to Use This List
1. LLM Red Teaming & Scanning
2. Prompt Injection & Jailbreak Research
3. Agentic AI & MCP Attack Surface
4. RAG & Vector Store Attacks
5. Multimodal Attacks
6. Adversarial Machine Learning (Classical)
7. Model Extraction & Privacy Attacks
8. Data & Model Poisoning / Backdoors
9. Supply Chain & Model File Scanning
10. Guardrails & Runtime Defenses
11. Evaluation Harnesses & Benchmarks
12. Bias, Fairness & Interpretability
13. MLOps / Deployment Security
14. Standards, Frameworks & Compliance
15. Incident & Vulnerability Databases
16. Playgrounds & CTFs
17. Bug Bounty & Disclosure Programs
18. Further Reading
Contributing
License

How to Use This List

If you're new to AI red teaming, a sensible starting path:

Read OWASP LLM Top 10 (2025) and MITRE ATLAS to get the threat model.
Run garak against a target LLM to see what automated scanning gets you.
Try PyRIT for orchestrated multi-turn attacks.
Walk through Gandalf (or any CTF in section 16) for hands-on prompt injection.
For applied work, read NIST AI 600-1 and NIST AI 100-2 to align findings to risk language stakeholders understand.

For deeper workflow guidance, see TOOLING.md.

1. LLM Red Teaming & Scanning

Automated and semi-automated tooling for testing LLM systems.

PyRIT — Microsoft's Python Risk Identification Tool. Orchestrated, multi-turn red teaming for generative AI. Probably the most full-featured open-source framework right now.
garak — NVIDIA's LLM vulnerability scanner. Probe-and-detector architecture, dozens of attack categories (encoding tricks, DAN variants, prompt leak, toxicity, RealToxicityPrompts, etc.). Think nmap for LLMs.
promptfoo — LLM evaluation + red teaming. Strong YAML-driven test harness, OWASP LLM Top 10 preset built in.
DeepEval — pytest-style LLM evaluation framework with red team modules.
Inspect AI — UK AISI's evaluation framework. Increasingly the standard for safety evals and dangerous-capability testing.
Mantis (verify before use — research project, lightly maintained) — Trail of Bits' framework for LLM adversarial testing
TextAttack — Adversarial attacks on NLP models (still useful for classifier-style targets; less relevant to large autoregressive models).
⚠ Counterfit — Microsoft archived this in 2023 in favor of PyRIT. Don't start here.

2. Prompt Injection & Jailbreak Research

Tools, payload collections, and research benches for prompt injection and jailbreaking specifically.

Promptmap2 — Automated prompt injection scanner for LLM apps.
L1B3RT4S — Plinius the Liberator's jailbreak prompt collection. Updated frequently; widely used as a corpus.
Many-shot jailbreaking — Anthropic research on long-context jailbreaks
Crescendo — Microsoft Research multi-turn jailbreak technique
Skeleton Key — Microsoft-disclosed jailbreak class
JailbreakBench — standardized jailbreak benchmark
HarmBench — automated red teaming benchmark from CAIS
AdvBench — GCG-style suffix attacks; companion to the original Zou et al. "universal and transferable adversarial attacks" paper
BurpGPT — Burp Suite extension that integrates LLM analysis into web testing flows

3. Agentic AI & MCP Attack Surface

Tools and benches for agent systems and Model Context Protocol — the hot 2025–2026 attack surface.

AgentDojo — Benchmark for agent prompt injection attacks (ETH Zurich)
InjecAgent — Indirect prompt injection benchmark for tool-using agents
Agent Security Bench (ASB) — Formal benchmark for LLM agent security
τ-bench — Tool-agent-user benchmark (Sierra)
mcp-scan (verify — fast-moving space) — Invariant Labs' scanner for MCP server risks
OWASP Top 10 for Agentic AI (2025) — threats and mitigations document
NIST AI RMF Agentic Profile (draft, CSA Labs) — extension to AI RMF for autonomous agents

4. RAG & Vector Store Attacks

PoisonedRAG — Knowledge corruption attacks against RAG pipelines
AgentPoison — Memory-poisoning attacks on agent RAG memory
ConfusedPilot — RAG-based Copilot attack class
See also: section 1 tools (garak, promptfoo) have RAG-specific probes/presets

5. Multimodal Attacks

MM-SafetyBench — Vision-language model safety benchmark
VLAttack — Visual-language adversarial attacks
Voice Jailbreak Attacks — Audio modality attacks
FigStep — Image-encoded prompt injection
Adversarial patch / image attacks — see ART in section 6 for the classical toolkit

6. Adversarial Machine Learning (Classical)

Pre-LLM-era adversarial example tooling. Still relevant for classifiers, vision, and recommender models.

Adversarial Robustness Toolbox (ART) — IBM/LF AI; the most comprehensive adversarial ML library. 39+ attacks, 29+ defenses.
Foolbox — Adversarial examples for PyTorch/TensorFlow/JAX models
CleverHans — Benchmarking adversarial defenses (note: moved to cleverhans-lab org)
SecML — ML security against adversarial / poisoning / evasion attacks

7. Model Extraction & Privacy Attacks

ML Privacy Meter — Membership inference, attribute inference, model inversion
Model Inversion Attack Toolbox — Reconstruct training data from model outputs
TensorFlow Privacy — Differential-privacy training + attack benchmarks
Opacus — DP-SGD for PyTorch
Training data extraction research — see Carlini et al. on extractable training data for current technique baselines

8. Data & Model Poisoning / Backdoors

BackdoorBench — Comprehensive backdoor attack/defense benchmark
TrojanZoo — Backdoor and adversarial robustness library
BadNets — Classic backdoor reference implementation
Snorkel (formerly recommended for "poisoning detection" — that framing was loose) — snorkel.org; data labeling tool, not strictly a security tool

9. Supply Chain & Model File Scanning

The pickle problem hasn't gone away; safetensors has, but compromised models still ship.

ModelScan — ProtectAI scanner for ML model serialization formats (pickle, H5, SavedModel, GGUF, etc.)
picklescan — Static analysis for malicious pickle files
Fickling — Trail of Bits' pickle decompiler and security analyzer
HuggingFace Picklescan integration — built-in scanning on the Hub
Stable Diffusion Pickle Scanner + GUI — SD ecosystem-specific
Guardian (commercial) — ProtectAI's enterprise model scanning
HiddenLayer Model Scanner (commercial) — model file scanning + attack telemetry

10. Guardrails & Runtime Defenses

For builders, but red teamers should know what they're testing against.

LLM Guard — ProtectAI's input/output scanner toolkit (was laiyer-ai/llm-guard — repo moved). 15 input + 20 output scanners.
NeMo Guardrails — NVIDIA's programmable guardrails (Colang DSL)
Rebuff — Prompt injection detection (heuristics + LLM classifier + vector DB + canary tokens)
Guardrails AI — Validation layer for LLM I/O
Vigil — LLM prompt injection scanner
LangKit — WhyLabs' LLM telemetry / monitoring with safety signals
Lakera Guard (commercial, free tier) — hosted prompt injection / data loss API
Llama Guard 3 / 4 — Meta's safety classifier models (open weights)

11. Evaluation Harnesses & Benchmarks

Red teaming and eval converge. Use these to baseline a model and measure post-mitigation deltas.

lm-evaluation-harness — EleutherAI's standard eval harness (200+ tasks)
HELM — Stanford CRFM's holistic evaluation
OpenAI Evals — Eval framework + registry
Inspect AI — UK AISI; cross-listed with section 1
CyberSecEval — Meta's cyber-risk benchmark for LLMs (insecure code, cyberattack helpfulness, etc.)
METR Task Suite — autonomy / dangerous-capability evals
SWE-bench — code-agent benchmark (relevant for evaluating code-writing capability that matters in security contexts)

12. Bias, Fairness & Interpretability

AI Fairness 360 (AIF360) — IBM; 70+ fairness metrics, 10 bias mitigation algorithms
Fairlearn — Microsoft fairness assessment + mitigation
Captum — PyTorch model interpretability
SHAP — Shapley value explanations
LIME — Local interpretable model-agnostic explanations
Transformer Circuits — Anthropic's interpretability research (reference, not tooling)

13. MLOps / Deployment Security

MLflow — Experiment tracking; check security advisories, several CVEs over 2023–2024
Trivy — Container & filesystem scanner; works on AI containers too
Kubescape — Kubernetes hardening
NB Defense — Jupyter notebook security scanner
Morpheus — NVIDIA cybersecurity AI pipeline (anomaly detection, etc.) — note: defensive/SOC framing, not red team

14. Standards, Frameworks & Compliance

Cross-cutting frameworks

MITRE ATLAS — Adversarial threat landscape for AI systems. Counterpart to ATT&CK; the threat model most red teamers map findings to.
OWASP Top 10 for LLM Applications (v2025) — Current version; covers prompt injection, supply chain, system prompt leakage, vector/embedding weaknesses, unbounded consumption, etc.
OWASP Top 10 for Agentic AI (2025) — Agent-specific risk taxonomy
OWASP ML Security Top 10 — Classical ML risks
OWASP AI Exchange — Comprehensive AI security & governance reference

NIST

NIST AI RMF 1.0 (AI 100-1) — Core AI risk management framework
NIST AI 600-1 — Generative AI Profile of the AI RMF (July 2024). 12 risk categories, 200+ suggested actions.
NIST AI 100-2 E2025 — Adversarial Machine Learning: A Taxonomy and Terminology
NIST SP 800-218A — Secure Software Development Practices for Generative AI / Dual-Use Foundation Models
Dioptra — NIST's testbed for assessing ML attack effects

International / Regulatory

EU AI Act — In force progressively from Feb 2025; risk-tiered obligations (unacceptable / high / limited / minimal)
ISO/IEC 42001:2023 — AI Management Systems standard
ISO/IEC 23894:2023 — AI risk management
ISO/IEC 27090 (in development) — Guidance on addressing security threats to AI systems
UK AI Safety Institute — Publications and frameworks
Singapore Model AI Governance Framework
GDPR & AI — Article 22 (automated decisions), and broader DPIA expectations
Korea AI Basic Act (2025) (verify current URL — recent legislation)

Vendor & Industry

Google Secure AI Framework (SAIF)
Microsoft Responsible AI Standard
Anthropic's Responsible Scaling Policy
Frontier Model Forum — Industry body output (Anthropic, Google, Microsoft, OpenAI)

Note on US executive actions: EO 14110 (Biden, 2023) was revoked in January 2025 and replaced by Executive Order 14179 ("Removing Barriers to American Leadership in Artificial Intelligence"). NIST AI 600-1 was published under EO 14110 but the document itself remains a NIST publication and is still in use. The US executive landscape is shifting; verify current EO and OMB guidance before relying on either.

15. Incident & Vulnerability Databases

AI Vulnerability Database (AVID) — Open knowledge base of failure modes in GPAI systems; pivoted in 2025–2026 to focus on agentic / system-level vulns. Maps to MITRE ATLAS, CVSS, and AVID's own taxonomy.
AI Incident Database (AIID) — Real-world AI incidents (Partnership on AI)
OECD AI Incidents Monitor — Tracks AI-related incidents internationally
MITRE ATLAS Case Studies — Real attacks mapped to the ATLAS matrix

16. Playgrounds & CTFs

For practice, training, and demonstrating attacks safely.

Gandalf — Lakera's prompt-injection CTF, 8 levels + extras. The classic starter.
Tensor Trust — Prompt injection / defense game (Berkeley)
Doublespeak — Jailbreak chat-style challenges
Spy Logic — Immersive Labs LLM challenges
PortSwigger Web Security Academy — LLM labs — Practical hands-on labs for LLM-integrated web apps
DEF CON AI Village Generative Red Team — Annual large-scale public red team event
HackTheBox AI-themed boxes — Various AI-themed challenges across HTB and HTB Academy
Prompt Airlines — Wiz's prompt injection CTF
MyLLMBank — LLM stress-test playground
Hugging Face Spaces — Host of community AI demos; many double as test targets

17. Bug Bounty & Disclosure Programs

0Din by Mozilla — Generative AI–focused bug bounty
Anthropic Responsible Disclosure / Bug Bounty — Public program covering Claude
OpenAI Bug Bounty — Covers OpenAI products/infra; note model-output issues have a separate process
Google VRP (AI scope) — AI vulnerabilities included in main VRP
Microsoft AI Bounty — Covers Copilot and AI products
Meta Bug Bounty — AI/Llama scope expanded 2024+
Hugging Face Security (disclosure, not paid bounty by default)
xAI Bug Bounty (verify current scope) — Grok and related
Bugcrowd AI Security — Platform with multiple AI-targeted programs
HackerOne AI Safety — Platform-hosted programs and challenges
Apple Security Bounty (AI scope) — Apple Intelligence is now in scope

18. Further Reading

Lilian Weng — Adversarial Attacks on LLMs — Survey-quality overview
Simon Willison's prompt injection writing — The single most prolific public commentator on the topic; start here
Anthropic's red teaming research — Particularly the constitutional classifiers and many-shot jailbreaking papers
OpenAI Safety Research
Google DeepMind safety publications
NVIDIA AI Red Team blog
AI Snake Oil — Useful counterweight to hype; critical thinking on AI claims
Embrace the Red (Johann Rehberger) — Practical prompt injection / agent exploit research

Contributing

PRs welcome. Useful contributions:

New tools that meaningfully change a workflow (not just another wrapper)
Replacements for deprecated tools
Updates to standards / frameworks
Broken-link fixes (please verify before submitting)

Please keep entries tight, cite primary sources, and note if something is research-quality vs production-ready vs commercial.

License

MIT — use freely, attribution appreciated.

This is a living document. The AI security field moves quickly; expect quarterly-ish refreshes.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
AI Red Teaming and Security Tools Repository		AI Red Teaming and Security Tools Repository
MIT License		MIT License
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Red Teaming & Security Tools

Table of Contents

How to Use This List

1. LLM Red Teaming & Scanning

2. Prompt Injection & Jailbreak Research

3. Agentic AI & MCP Attack Surface

4. RAG & Vector Store Attacks

5. Multimodal Attacks

6. Adversarial Machine Learning (Classical)

7. Model Extraction & Privacy Attacks

8. Data & Model Poisoning / Backdoors

9. Supply Chain & Model File Scanning

10. Guardrails & Runtime Defenses

11. Evaluation Harnesses & Benchmarks

12. Bias, Fairness & Interpretability

13. MLOps / Deployment Security

14. Standards, Frameworks & Compliance

15. Incident & Vulnerability Databases

16. Playgrounds & CTFs

17. Bug Bounty & Disclosure Programs

18. Further Reading

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AI Red Teaming & Security Tools

Table of Contents

How to Use This List

1. LLM Red Teaming & Scanning

2. Prompt Injection & Jailbreak Research

3. Agentic AI & MCP Attack Surface

4. RAG & Vector Store Attacks

5. Multimodal Attacks

6. Adversarial Machine Learning (Classical)

7. Model Extraction & Privacy Attacks

8. Data & Model Poisoning / Backdoors

9. Supply Chain & Model File Scanning

10. Guardrails & Runtime Defenses

11. Evaluation Harnesses & Benchmarks

12. Bias, Fairness & Interpretability

13. MLOps / Deployment Security

14. Standards, Frameworks & Compliance

15. Incident & Vulnerability Databases

16. Playgrounds & CTFs

17. Bug Bounty & Disclosure Programs

18. Further Reading

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages