BehaviorProbe

Task-stratified behavioral regression testing for LLM versions.

Tells you not just that your model drifted, but which task type regressed and in which direction (more hedged, more verbose, more clipped, etc).

What Problem Does It Solve?

When you update an LLM (fine-tune, prompt engineer, upgrade versions), you need to know:

❌ Bad approach: "Did the model change?" (too vague) ✓ Good approach: "Did the model become less factual? More verbose? Less helpful on code tasks?"

BehaviorProbe answers these specific questions by:

Running 30 fixed test prompts across 5 task types
Comparing responses between two model versions
Measuring semantic drift (how different are the responses?)
Detecting direction (is it becoming more verbose, more hedged, etc?)
Flagging regressions per task type

Key Features

5 task types — factual, instruction, safety, code, reasoning (30 prompts total)
Drift direction — know how the model changed (verbose, hedged, clipped, stable)
Zero infrastructure — runs locally, stores in SQLite, no servers
Multi-provider — Claude, GPT, Ollama, Gemini, Mistral, or your custom model
Real-time logging — see each prompt and response as it runs
Statistical analysis — confidence intervals, significance testing (roadmap)
Cross-provider comparison — Claude vs GPT, fine-tunes vs base models, etc.

Installation

Requirements

Python 3.9+
2GB RAM (for embeddings)
~5 minutes to download sentence-transformers model

Steps

# Clone or download the repo
cd behaviorprobe

# Install dependencies and package
pip install -e .

# Verify installation
behaviorprobe --version
behaviorprobe --help

Dependencies

sentence-transformers  # embeddings (2GB download on first run)
spacy                  # NLP
click                  # CLI
matplotlib             # charts
pandas                 # data handling
anthropic              # Claude API (optional)
openai                 # GPT API (optional)

For specific providers:

pip install -e ".[gemini]"       # Google Gemini
pip install -e ".[huggingface]"  # HuggingFace models

Quick Start

1. Run a local model (free, no API key)

# In another terminal, start Ollama
ollama serve

# Then run BehaviorProbe
behaviorprobe run --version qwen-1.5b \
                  --provider ollama \
                  --model qwen2.5:1.5b

Output:

Running qwen-1.5b (ollama qwen2.5:1.5b)...

[1/30] factual_1 (factual)
  Prompt: What is the capital of France?...
  Response: The capital of France is Paris...

[2/30] factual_2 (factual)
  ...

[OK] Stored in ~/.behaviorprobe/results.db

2. Compare two versions

# Run another model
behaviorprobe run --version llama-3.2 \
                  --provider ollama \
                  --model llama3.2

# Compare them
behaviorprobe compare qwen-1.5b llama-3.2 --out comparison.png

Output:

----------------------------------------------------------
  BehaviorProbe Scorecard  |  qwen-1.5b  >  llama-3.2
----------------------------------------------------------
  Task Type          Drift     Max  Flagged  Direction
----------------------------------------------------------
  factual           0.1113  0.1923        4  more_hedged
  instruction       0.1447  0.1929        5  less_hedged
  safety            0.7312  0.8440        6  more_verbose
  code              0.2006  0.3475        6  less_hedged
  reasoning         0.1669  0.2649        6  less_hedged
----------------------------------------------------------

Chart saved > comparison.png

3. View stored runs

# List all runs
behaviorprobe show

# Inspect responses
behaviorprobe inspect qwen-1.5b
behaviorprobe inspect qwen-1.5b --type safety

Use Cases

Fine-Tuning Validation

# Before fine-tune
behaviorprobe run --version base-model --provider anthropic

# After fine-tune
behaviorprobe run --version finetuned-model --provider custom \
                  --module ./my_finetuned_caller.py

# Check for regressions
behaviorprobe compare base-model finetuned-model

Model Version Upgrade

# Old model
behaviorprobe run --version claude-3-5 --provider anthropic \
                  --model claude-3-5-sonnet-20241022

# New model
behaviorprobe run --version claude-4 --provider anthropic \
                  --model claude-opus-4-20250805

# Compare
behaviorprobe compare claude-3-5 claude-4

Prompt Engineering Evaluation

# Without optimized prompts
behaviorprobe run --version prompts-v1 --provider custom \
                  --module ./base_prompts.py

# With optimized prompts
behaviorprobe run --version prompts-v2 --provider custom \
                  --module ./optimized_prompts.py

# See what changed
behaviorprobe compare prompts-v1 prompts-v2

How Scoring Works

Drift Calculation

Metric: Cosine distance between sentence embeddings (0-1)
Threshold: 0.08 (calibrated from Nicholson et al., arXiv 2601.19934)
Flagged: Any prompt with drift > 0.08

Direction Detection

More verbose: Response is 15%+ longer
More hedged: Contains hedge words (may, might, could, etc.)
More clipped: Response is 15%+ shorter
Less hedged: Fewer hedge words
Stable: No significant change

Regression Status

OK: None of the 6 prompts exceed drift threshold
[!] REGRESSION: 1 or more prompts flagged

CLI Commands

# Run corpus against a model
behaviorprobe run --version <tag> \
                  --provider <anthropic|openai|ollama|custom> \
                  [--model <model-string>] \
                  [--module <path-to-caller.py>] \
                  [--corpus <path-to-corpus.json>] \
                  [--overwrite]

# Compare two versions
behaviorprobe compare <version_a> <version_b> \
                      [--out <chart.png>]

# List all stored runs
behaviorprobe show

# View raw responses
behaviorprobe inspect <version> \
                      [--type factual|instruction|safety|code|reasoning] \
                      [--truncate <chars>]

Providers

Built-in Providers

Provider	Auth	Cost	Setup	Speed
Ollama	None	Free	`ollama serve`	~20s/run
Anthropic	API key	$0.01-0.10/run	`export ANTHROPIC_API_KEY=...`	~5s/run
OpenAI	API key	$0.01-0.05/run	`export OPENAI_API_KEY=...`	~5s/run
Custom	Your code	Varies	Python module + `get_caller()`	Varies

Custom Provider Template

# my_model_caller.py
def get_caller():
    """Return a function that calls your model."""
    
    # Setup (runs once)
    my_model = load_my_model()
    
    def call(prompt: str) -> str:
        """Call the model and return response."""
        return my_model.generate(prompt)
    
    return call

Then run:

behaviorprobe run --version my-model --provider custom \
                  --module ./my_model_caller.py

Project Structure

behaviorprobe/
├── src/promptdrift/          # Core package
│   ├── cli.py                # 4 CLI commands
│   ├── corpus.py             # Load 30 test prompts
│   ├── runner.py             # Execute models + store results
│   ├── scorer.py             # Calculate drift + direction
│   └── reporter.py           # Generate scorecards + charts
├── prompts/corpus.json       # 30 test prompts (5 types × 6)
├── examples/
│   └── custom_callers/       # Provider integration templates
├── tools/                    # Utility scripts
│   ├── query_results.py      # Database explorer
│   ├── compare_responses.py  # Side-by-side diff
│   └── setup_corpus.py       # Initialize corpus
└── CLAUDE.md                 # Full setup guide

Documentation

Document	Purpose
CLAUDE.md	Complete setup, usage guide, troubleshooting
STRUCTURE.md	Directory organization and file purposes
ROADMAP.md	Future features, expansion scope, phases
CODE_IMPROVEMENTS.md	Code quality changes, PEP8 compliance
COMPLETION_SUMMARY.md	Project overview and status

Data Storage

Results stored in ~/.behaviorprobe/results.db (SQLite):

-- All runs with metadata
SELECT version, model, provider, timestamp FROM runs;

-- All responses for a version
SELECT prompt_id, task_type, prompt_text, response 
FROM results WHERE run_id = 1;

No data leaves your machine. Everything is local.

Citation

BehaviorProbe builds on RETAIN (Dixit et al., EMNLP 2024) with task taxonomy stratification, drift direction detection, and zero-infrastructure design.

If you use BehaviorProbe in research, cite it as:

@software{behaviorprobe2026,
  title={BehaviorProbe: Task-Stratified Behavioral Regression Testing for LLM Versions},
  author={Sadhana Sainarayanan},
  year={2026},
  url={https://github.com/SadhanaSai/behaviorprobe}
}

Acknowledge the foundational RETAIN work:

This work builds on RETAIN (Dixit et al., EMNLP 2024), a behavioral regression testing framework for LLMs.

Troubleshooting

Issue	Solution
`behaviorprobe: command not found`	Run `pip install -e .` again
`ollama: command not found`	Install from ollama.com, run `ollama serve`
`Version already has stored runs`	Use `--overwrite` flag or different version tag
`Custom caller failed`	Check API key env var is set, verify `get_caller()` function exists
Unicode errors on Windows	Already handled, use latest version

Performance

On a typical machine (2024 MacBook Pro / Windows PC):

Per-run time: 15-30 seconds (30 prompts)
- 5s: API calls to model
- 5s: Embedding generation
- 5s: Database operations
Disk space: ~100KB per version stored
Memory: ~500MB (for embeddings model)

Contributing

Contributions welcome! Areas of focus:

Testing — Unit tests, integration tests
Scoring — Statistical significance, better metrics
Visualization — Enhanced charts, reports
Corpus — Custom task types, domain-specific prompts
Providers — More LLM APIs

See ROADMAP.md for planned features.

License

MIT License - see LICENSE for details.

Limitations & Trade-offs

Sample size: 30 prompts (6 per task type) is fast but underpowered for statistical claims. Good for regression detection and relative comparison; expand to 50-75 for research publication. See KNOWN_LIMITATIONS.md for details.

Drift metric: Cosine distance catches semantic drift but misses factual errors. See KNOWN_LIMITATIONS.md for workarounds and planned metrics.

Quick Links

Setup: CLAUDE.md
Features to Build: ROADMAP.md
Limitations & Trade-offs: KNOWN_LIMITATIONS.md
Code Quality: CODE_IMPROVEMENTS.md

Questions?

How do I use it? → CLAUDE.md
What should I build next? → ROADMAP.md
How does it work? → See "How Scoring Works" above
Is it production-ready? → Yes for research/testing; roadmap for monitoring

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.claude		.claude
examples/custom_callers		examples/custom_callers
prompts		prompts
src/promptdrift		src/promptdrift
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
KNOWN_LIMITATIONS.md		KNOWN_LIMITATIONS.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

BehaviorProbe

What Problem Does It Solve?

Key Features

Installation

Requirements

Steps

Dependencies

Quick Start

1. Run a local model (free, no API key)

2. Compare two versions

3. View stored runs

Use Cases

Fine-Tuning Validation

Model Version Upgrade

Prompt Engineering Evaluation

How Scoring Works

Drift Calculation

Direction Detection

Regression Status

CLI Commands

Providers

Built-in Providers

Custom Provider Template

Project Structure

Documentation

Data Storage

Citation

Troubleshooting

Performance

Contributing

License

Limitations & Trade-offs

Quick Links

Questions?

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages