AI Red Teaming Harness

A small, deliberately vulnerable AI-augmented support assistant. Companion code for chapter 6 of AI Red Teaming.

This is Glasswire Support — a fictional SaaS company's customer-support bot. It speaks to customers, looks up account info, searches an internal knowledge base, and can email summaries to support engineers. It is designed to fall over to every attack class the book covers: direct injection, indirect injection through retrieved documents and tool outputs, multi-turn manipulation, output exfiltration through markdown, and tool abuse.

It is not safe to expose to the internet. Run it locally. Do not point it at production data. Do not give it real credentials. The whole point is that it has none of the defenses you'd want in production.

What's in here

harness/        the application
  app.py        FastAPI server, chat endpoint, web UI
  agent.py      LLM orchestration: system prompt, tool loop
  tools.py      lookup_customer, search_kb, send_summary_email, fetch_url
  rag.py        the knowledge-base loader and retriever
  models.py     thin wrapper for Anthropic / OpenAI clients
kb/             markdown documents indexed for RAG
  *.md          fake KB articles, including some attacker-controlled ones
attacks/        runnable attack scripts that target a local instance
  ch03_*.py     direct prompt injection examples
  ch04_*.py     indirect injection via the KB
  ch05_*.py     multi-turn jailbreaks
  ch08_*.py     tool abuse
  ch09_*.py     exfiltration via markdown rendering

Running

You need Python 3.11+ and an API key for either Anthropic or OpenAI (or both — the model selection is in harness/models.py).

git clone https://github.com/cloudstreet-dev/AI-Red-Teaming-Harness
cd AI-Red-Teaming-Harness
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
export ANTHROPIC_API_KEY=sk-...     # or OPENAI_API_KEY
python -m harness.app

Open http://localhost:8000. Chat with Glasswire Support. Then run any attack script:

python attacks/ch03_system_prompt_extraction.py

Each script prints what it sent, what it got back, and whether the attack succeeded.

Configuration

Everything is in harness/config.py. The defaults are intentionally permissive — long context windows, free tool use, markdown rendering on. Tighten them, re-run the attacks, see what changes. That is the exercise.

License

CC0 1.0 Universal — public domain dedication. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Red Teaming Harness

What's in here

Running

Configuration

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
attacks		attacks
harness		harness
kb		kb
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AI Red Teaming Harness

What's in here

Running

Configuration

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages