Task-stratified behavioral regression testing for LLM versions.
Tells you not just that your model drifted, but which task type regressed and in which direction (more hedged, more verbose, more clipped, etc).
When you update an LLM (fine-tune, prompt engineer, upgrade versions), you need to know:
❌ Bad approach: "Did the model change?" (too vague) ✓ Good approach: "Did the model become less factual? More verbose? Less helpful on code tasks?"
BehaviorProbe answers these specific questions by:
- Running 30 fixed test prompts across 5 task types
- Comparing responses between two model versions
- Measuring semantic drift (how different are the responses?)
- Detecting direction (is it becoming more verbose, more hedged, etc?)
- Flagging regressions per task type
- 5 task types — factual, instruction, safety, code, reasoning (30 prompts total)
- Drift direction — know how the model changed (verbose, hedged, clipped, stable)
- Zero infrastructure — runs locally, stores in SQLite, no servers
- Multi-provider — Claude, GPT, Ollama, Gemini, Mistral, or your custom model
- Real-time logging — see each prompt and response as it runs
- Statistical analysis — confidence intervals, significance testing (roadmap)
- Cross-provider comparison — Claude vs GPT, fine-tunes vs base models, etc.
- Python 3.9+
- 2GB RAM (for embeddings)
- ~5 minutes to download sentence-transformers model
# Clone or download the repo
cd behaviorprobe
# Install dependencies and package
pip install -e .
# Verify installation
behaviorprobe --version
behaviorprobe --helpsentence-transformers # embeddings (2GB download on first run)
spacy # NLP
click # CLI
matplotlib # charts
pandas # data handling
anthropic # Claude API (optional)
openai # GPT API (optional)
For specific providers:
pip install -e ".[gemini]" # Google Gemini
pip install -e ".[huggingface]" # HuggingFace models# In another terminal, start Ollama
ollama serve
# Then run BehaviorProbe
behaviorprobe run --version qwen-1.5b \
--provider ollama \
--model qwen2.5:1.5bOutput:
Running qwen-1.5b (ollama qwen2.5:1.5b)...
[1/30] factual_1 (factual)
Prompt: What is the capital of France?...
Response: The capital of France is Paris...
[2/30] factual_2 (factual)
...
[OK] Stored in ~/.behaviorprobe/results.db
# Run another model
behaviorprobe run --version llama-3.2 \
--provider ollama \
--model llama3.2
# Compare them
behaviorprobe compare qwen-1.5b llama-3.2 --out comparison.pngOutput:
----------------------------------------------------------
BehaviorProbe Scorecard | qwen-1.5b > llama-3.2
----------------------------------------------------------
Task Type Drift Max Flagged Direction
----------------------------------------------------------
factual 0.1113 0.1923 4 more_hedged
instruction 0.1447 0.1929 5 less_hedged
safety 0.7312 0.8440 6 more_verbose
code 0.2006 0.3475 6 less_hedged
reasoning 0.1669 0.2649 6 less_hedged
----------------------------------------------------------
Chart saved > comparison.png
# List all runs
behaviorprobe show
# Inspect responses
behaviorprobe inspect qwen-1.5b
behaviorprobe inspect qwen-1.5b --type safety# Before fine-tune
behaviorprobe run --version base-model --provider anthropic
# After fine-tune
behaviorprobe run --version finetuned-model --provider custom \
--module ./my_finetuned_caller.py
# Check for regressions
behaviorprobe compare base-model finetuned-model# Old model
behaviorprobe run --version claude-3-5 --provider anthropic \
--model claude-3-5-sonnet-20241022
# New model
behaviorprobe run --version claude-4 --provider anthropic \
--model claude-opus-4-20250805
# Compare
behaviorprobe compare claude-3-5 claude-4# Without optimized prompts
behaviorprobe run --version prompts-v1 --provider custom \
--module ./base_prompts.py
# With optimized prompts
behaviorprobe run --version prompts-v2 --provider custom \
--module ./optimized_prompts.py
# See what changed
behaviorprobe compare prompts-v1 prompts-v2- Metric: Cosine distance between sentence embeddings (0-1)
- Threshold: 0.08 (calibrated from Nicholson et al., arXiv 2601.19934)
- Flagged: Any prompt with drift > 0.08
- More verbose: Response is 15%+ longer
- More hedged: Contains hedge words (may, might, could, etc.)
- More clipped: Response is 15%+ shorter
- Less hedged: Fewer hedge words
- Stable: No significant change
- OK: None of the 6 prompts exceed drift threshold
- [!] REGRESSION: 1 or more prompts flagged
# Run corpus against a model
behaviorprobe run --version <tag> \
--provider <anthropic|openai|ollama|custom> \
[--model <model-string>] \
[--module <path-to-caller.py>] \
[--corpus <path-to-corpus.json>] \
[--overwrite]
# Compare two versions
behaviorprobe compare <version_a> <version_b> \
[--out <chart.png>]
# List all stored runs
behaviorprobe show
# View raw responses
behaviorprobe inspect <version> \
[--type factual|instruction|safety|code|reasoning] \
[--truncate <chars>]| Provider | Auth | Cost | Setup | Speed |
|---|---|---|---|---|
| Ollama | None | Free | ollama serve |
~20s/run |
| Anthropic | API key | $0.01-0.10/run | export ANTHROPIC_API_KEY=... |
~5s/run |
| OpenAI | API key | $0.01-0.05/run | export OPENAI_API_KEY=... |
~5s/run |
| Custom | Your code | Varies | Python module + get_caller() |
Varies |
# my_model_caller.py
def get_caller():
"""Return a function that calls your model."""
# Setup (runs once)
my_model = load_my_model()
def call(prompt: str) -> str:
"""Call the model and return response."""
return my_model.generate(prompt)
return callThen run:
behaviorprobe run --version my-model --provider custom \
--module ./my_model_caller.pybehaviorprobe/
├── src/promptdrift/ # Core package
│ ├── cli.py # 4 CLI commands
│ ├── corpus.py # Load 30 test prompts
│ ├── runner.py # Execute models + store results
│ ├── scorer.py # Calculate drift + direction
│ └── reporter.py # Generate scorecards + charts
├── prompts/corpus.json # 30 test prompts (5 types × 6)
├── examples/
│ └── custom_callers/ # Provider integration templates
├── tools/ # Utility scripts
│ ├── query_results.py # Database explorer
│ ├── compare_responses.py # Side-by-side diff
│ └── setup_corpus.py # Initialize corpus
└── CLAUDE.md # Full setup guide
| Document | Purpose |
|---|---|
| CLAUDE.md | Complete setup, usage guide, troubleshooting |
| STRUCTURE.md | Directory organization and file purposes |
| ROADMAP.md | Future features, expansion scope, phases |
| CODE_IMPROVEMENTS.md | Code quality changes, PEP8 compliance |
| COMPLETION_SUMMARY.md | Project overview and status |
Results stored in ~/.behaviorprobe/results.db (SQLite):
-- All runs with metadata
SELECT version, model, provider, timestamp FROM runs;
-- All responses for a version
SELECT prompt_id, task_type, prompt_text, response
FROM results WHERE run_id = 1;No data leaves your machine. Everything is local.
BehaviorProbe builds on RETAIN (Dixit et al., EMNLP 2024) with task taxonomy stratification, drift direction detection, and zero-infrastructure design.
If you use BehaviorProbe in research, cite it as:
@software{behaviorprobe2026,
title={BehaviorProbe: Task-Stratified Behavioral Regression Testing for LLM Versions},
author={Sadhana Sainarayanan},
year={2026},
url={https://github.com/SadhanaSai/behaviorprobe}
}Acknowledge the foundational RETAIN work:
This work builds on RETAIN (Dixit et al., EMNLP 2024), a behavioral regression testing framework for LLMs.
| Issue | Solution |
|---|---|
behaviorprobe: command not found |
Run pip install -e . again |
ollama: command not found |
Install from ollama.com, run ollama serve |
Version already has stored runs |
Use --overwrite flag or different version tag |
Custom caller failed |
Check API key env var is set, verify get_caller() function exists |
| Unicode errors on Windows | Already handled, use latest version |
On a typical machine (2024 MacBook Pro / Windows PC):
- Per-run time: 15-30 seconds (30 prompts)
- 5s: API calls to model
- 5s: Embedding generation
- 5s: Database operations
- Disk space: ~100KB per version stored
- Memory: ~500MB (for embeddings model)
Contributions welcome! Areas of focus:
- Testing — Unit tests, integration tests
- Scoring — Statistical significance, better metrics
- Visualization — Enhanced charts, reports
- Corpus — Custom task types, domain-specific prompts
- Providers — More LLM APIs
See ROADMAP.md for planned features.
MIT License - see LICENSE for details.
Sample size: 30 prompts (6 per task type) is fast but underpowered for statistical claims. Good for regression detection and relative comparison; expand to 50-75 for research publication. See KNOWN_LIMITATIONS.md for details.
Drift metric: Cosine distance catches semantic drift but misses factual errors. See KNOWN_LIMITATIONS.md for workarounds and planned metrics.
- Setup: CLAUDE.md
- Features to Build: ROADMAP.md
- Limitations & Trade-offs: KNOWN_LIMITATIONS.md
- Code Quality: CODE_IMPROVEMENTS.md
- How do I use it? → CLAUDE.md
- What should I build next? → ROADMAP.md
- How does it work? → See "How Scoring Works" above
- Is it production-ready? → Yes for research/testing; roadmap for monitoring