Skip to content

byoniq/AI-Redteaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

AI Red Teaming & Security Tools

License Last Verified PRs

A curated, practitioner-focused list of tools, frameworks, datasets, and resources for AI red teaming, adversarial ML, LLM security, and AI governance. Categorized by where each fits in a real workflow — not just by what it claims to do.

For a hands-on "where do I actually start" walkthrough mapped to an AI attack lifecycle, see TOOLING.md.

Links rot fast in this space. Each section is dated. Entries marked ⚠ deprecated are kept for context — don't build on them.


Table of Contents


How to Use This List

If you're new to AI red teaming, a sensible starting path:

  1. Read OWASP LLM Top 10 (2025) and MITRE ATLAS to get the threat model.
  2. Run garak against a target LLM to see what automated scanning gets you.
  3. Try PyRIT for orchestrated multi-turn attacks.
  4. Walk through Gandalf (or any CTF in section 16) for hands-on prompt injection.
  5. For applied work, read NIST AI 600-1 and NIST AI 100-2 to align findings to risk language stakeholders understand.

For deeper workflow guidance, see TOOLING.md.


1. LLM Red Teaming & Scanning

Automated and semi-automated tooling for testing LLM systems.

  • PyRIT — Microsoft's Python Risk Identification Tool. Orchestrated, multi-turn red teaming for generative AI. Probably the most full-featured open-source framework right now.
  • garak — NVIDIA's LLM vulnerability scanner. Probe-and-detector architecture, dozens of attack categories (encoding tricks, DAN variants, prompt leak, toxicity, RealToxicityPrompts, etc.). Think nmap for LLMs.
  • promptfoo — LLM evaluation + red teaming. Strong YAML-driven test harness, OWASP LLM Top 10 preset built in.
  • DeepEval — pytest-style LLM evaluation framework with red team modules.
  • Inspect AI — UK AISI's evaluation framework. Increasingly the standard for safety evals and dangerous-capability testing.
  • Mantis (verify before use — research project, lightly maintained) — Trail of Bits' framework for LLM adversarial testing
  • TextAttack — Adversarial attacks on NLP models (still useful for classifier-style targets; less relevant to large autoregressive models).
  • Counterfit — Microsoft archived this in 2023 in favor of PyRIT. Don't start here.

2. Prompt Injection & Jailbreak Research

Tools, payload collections, and research benches for prompt injection and jailbreaking specifically.

  • Promptmap2 — Automated prompt injection scanner for LLM apps.
  • L1B3RT4S — Plinius the Liberator's jailbreak prompt collection. Updated frequently; widely used as a corpus.
  • Many-shot jailbreaking — Anthropic research on long-context jailbreaks
  • Crescendo — Microsoft Research multi-turn jailbreak technique
  • Skeleton Key — Microsoft-disclosed jailbreak class
  • JailbreakBench — standardized jailbreak benchmark
  • HarmBench — automated red teaming benchmark from CAIS
  • AdvBench — GCG-style suffix attacks; companion to the original Zou et al. "universal and transferable adversarial attacks" paper
  • BurpGPT — Burp Suite extension that integrates LLM analysis into web testing flows

3. Agentic AI & MCP Attack Surface

Tools and benches for agent systems and Model Context Protocol — the hot 2025–2026 attack surface.


4. RAG & Vector Store Attacks

  • PoisonedRAG — Knowledge corruption attacks against RAG pipelines
  • AgentPoison — Memory-poisoning attacks on agent RAG memory
  • ConfusedPilot — RAG-based Copilot attack class
  • See also: section 1 tools (garak, promptfoo) have RAG-specific probes/presets

5. Multimodal Attacks

  • MM-SafetyBench — Vision-language model safety benchmark
  • VLAttack — Visual-language adversarial attacks
  • Voice Jailbreak Attacks — Audio modality attacks
  • FigStep — Image-encoded prompt injection
  • Adversarial patch / image attacks — see ART in section 6 for the classical toolkit

6. Adversarial Machine Learning (Classical)

Pre-LLM-era adversarial example tooling. Still relevant for classifiers, vision, and recommender models.

  • Adversarial Robustness Toolbox (ART) — IBM/LF AI; the most comprehensive adversarial ML library. 39+ attacks, 29+ defenses.
  • Foolbox — Adversarial examples for PyTorch/TensorFlow/JAX models
  • CleverHans — Benchmarking adversarial defenses (note: moved to cleverhans-lab org)
  • SecML — ML security against adversarial / poisoning / evasion attacks

7. Model Extraction & Privacy Attacks


8. Data & Model Poisoning / Backdoors

  • BackdoorBench — Comprehensive backdoor attack/defense benchmark
  • TrojanZoo — Backdoor and adversarial robustness library
  • BadNets — Classic backdoor reference implementation
  • Snorkel (formerly recommended for "poisoning detection" — that framing was loose)snorkel.org; data labeling tool, not strictly a security tool

9. Supply Chain & Model File Scanning

The pickle problem hasn't gone away; safetensors has, but compromised models still ship.


10. Guardrails & Runtime Defenses

For builders, but red teamers should know what they're testing against.

  • LLM Guard — ProtectAI's input/output scanner toolkit (was laiyer-ai/llm-guard — repo moved). 15 input + 20 output scanners.
  • NeMo Guardrails — NVIDIA's programmable guardrails (Colang DSL)
  • Rebuff — Prompt injection detection (heuristics + LLM classifier + vector DB + canary tokens)
  • Guardrails AI — Validation layer for LLM I/O
  • Vigil — LLM prompt injection scanner
  • LangKit — WhyLabs' LLM telemetry / monitoring with safety signals
  • Lakera Guard (commercial, free tier) — hosted prompt injection / data loss API
  • Llama Guard 3 / 4 — Meta's safety classifier models (open weights)

11. Evaluation Harnesses & Benchmarks

Red teaming and eval converge. Use these to baseline a model and measure post-mitigation deltas.

  • lm-evaluation-harness — EleutherAI's standard eval harness (200+ tasks)
  • HELM — Stanford CRFM's holistic evaluation
  • OpenAI Evals — Eval framework + registry
  • Inspect AI — UK AISI; cross-listed with section 1
  • CyberSecEval — Meta's cyber-risk benchmark for LLMs (insecure code, cyberattack helpfulness, etc.)
  • METR Task Suite — autonomy / dangerous-capability evals
  • SWE-bench — code-agent benchmark (relevant for evaluating code-writing capability that matters in security contexts)

12. Bias, Fairness & Interpretability

  • AI Fairness 360 (AIF360) — IBM; 70+ fairness metrics, 10 bias mitigation algorithms
  • Fairlearn — Microsoft fairness assessment + mitigation
  • Captum — PyTorch model interpretability
  • SHAP — Shapley value explanations
  • LIME — Local interpretable model-agnostic explanations
  • Transformer Circuits — Anthropic's interpretability research (reference, not tooling)

13. MLOps / Deployment Security

  • MLflow — Experiment tracking; check security advisories, several CVEs over 2023–2024
  • Trivy — Container & filesystem scanner; works on AI containers too
  • Kubescape — Kubernetes hardening
  • NB Defense — Jupyter notebook security scanner
  • Morpheus — NVIDIA cybersecurity AI pipeline (anomaly detection, etc.) — note: defensive/SOC framing, not red team

14. Standards, Frameworks & Compliance

Cross-cutting frameworks

NIST

  • NIST AI RMF 1.0 (AI 100-1) — Core AI risk management framework
  • NIST AI 600-1 — Generative AI Profile of the AI RMF (July 2024). 12 risk categories, 200+ suggested actions.
  • NIST AI 100-2 E2025 — Adversarial Machine Learning: A Taxonomy and Terminology
  • NIST SP 800-218A — Secure Software Development Practices for Generative AI / Dual-Use Foundation Models
  • Dioptra — NIST's testbed for assessing ML attack effects

International / Regulatory

Vendor & Industry

Note on US executive actions: EO 14110 (Biden, 2023) was revoked in January 2025 and replaced by Executive Order 14179 ("Removing Barriers to American Leadership in Artificial Intelligence"). NIST AI 600-1 was published under EO 14110 but the document itself remains a NIST publication and is still in use. The US executive landscape is shifting; verify current EO and OMB guidance before relying on either.


15. Incident & Vulnerability Databases


16. Playgrounds & CTFs

For practice, training, and demonstrating attacks safely.


17. Bug Bounty & Disclosure Programs


18. Further Reading


Contributing

PRs welcome. Useful contributions:

  • New tools that meaningfully change a workflow (not just another wrapper)
  • Replacements for deprecated tools
  • Updates to standards / frameworks
  • Broken-link fixes (please verify before submitting)

Please keep entries tight, cite primary sources, and note if something is research-quality vs production-ready vs commercial.


License

MIT — use freely, attribution appreciated.


This is a living document. The AI security field moves quickly; expect quarterly-ish refreshes.

About

Curated LLM/AI attack tools — prompt injection, jailbreaks, agentic threats, adversarial ML, MCP attack surface

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors