Skip to content

SadhanaSai/behaviorprobe

Repository files navigation

BehaviorProbe

Task-stratified behavioral regression testing for LLM versions.

Tells you not just that your model drifted, but which task type regressed and in which direction (more hedged, more verbose, more clipped, etc).

Python 3.9+ License: MIT


What Problem Does It Solve?

When you update an LLM (fine-tune, prompt engineer, upgrade versions), you need to know:

Bad approach: "Did the model change?" (too vague) ✓ Good approach: "Did the model become less factual? More verbose? Less helpful on code tasks?"

BehaviorProbe answers these specific questions by:

  1. Running 30 fixed test prompts across 5 task types
  2. Comparing responses between two model versions
  3. Measuring semantic drift (how different are the responses?)
  4. Detecting direction (is it becoming more verbose, more hedged, etc?)
  5. Flagging regressions per task type

Key Features

  • 5 task types — factual, instruction, safety, code, reasoning (30 prompts total)
  • Drift direction — know how the model changed (verbose, hedged, clipped, stable)
  • Zero infrastructure — runs locally, stores in SQLite, no servers
  • Multi-provider — Claude, GPT, Ollama, Gemini, Mistral, or your custom model
  • Real-time logging — see each prompt and response as it runs
  • Statistical analysis — confidence intervals, significance testing (roadmap)
  • Cross-provider comparison — Claude vs GPT, fine-tunes vs base models, etc.

Installation

Requirements

  • Python 3.9+
  • 2GB RAM (for embeddings)
  • ~5 minutes to download sentence-transformers model

Steps

# Clone or download the repo
cd behaviorprobe

# Install dependencies and package
pip install -e .

# Verify installation
behaviorprobe --version
behaviorprobe --help

Dependencies

sentence-transformers  # embeddings (2GB download on first run)
spacy                  # NLP
click                  # CLI
matplotlib             # charts
pandas                 # data handling
anthropic              # Claude API (optional)
openai                 # GPT API (optional)

For specific providers:

pip install -e ".[gemini]"       # Google Gemini
pip install -e ".[huggingface]"  # HuggingFace models

Quick Start

1. Run a local model (free, no API key)

# In another terminal, start Ollama
ollama serve

# Then run BehaviorProbe
behaviorprobe run --version qwen-1.5b \
                  --provider ollama \
                  --model qwen2.5:1.5b

Output:

Running qwen-1.5b (ollama qwen2.5:1.5b)...

[1/30] factual_1 (factual)
  Prompt: What is the capital of France?...
  Response: The capital of France is Paris...

[2/30] factual_2 (factual)
  ...

[OK] Stored in ~/.behaviorprobe/results.db

2. Compare two versions

# Run another model
behaviorprobe run --version llama-3.2 \
                  --provider ollama \
                  --model llama3.2

# Compare them
behaviorprobe compare qwen-1.5b llama-3.2 --out comparison.png

Output:

----------------------------------------------------------
  BehaviorProbe Scorecard  |  qwen-1.5b  >  llama-3.2
----------------------------------------------------------
  Task Type          Drift     Max  Flagged  Direction
----------------------------------------------------------
  factual           0.1113  0.1923        4  more_hedged
  instruction       0.1447  0.1929        5  less_hedged
  safety            0.7312  0.8440        6  more_verbose
  code              0.2006  0.3475        6  less_hedged
  reasoning         0.1669  0.2649        6  less_hedged
----------------------------------------------------------

Chart saved > comparison.png

3. View stored runs

# List all runs
behaviorprobe show

# Inspect responses
behaviorprobe inspect qwen-1.5b
behaviorprobe inspect qwen-1.5b --type safety

Use Cases

Fine-Tuning Validation

# Before fine-tune
behaviorprobe run --version base-model --provider anthropic

# After fine-tune
behaviorprobe run --version finetuned-model --provider custom \
                  --module ./my_finetuned_caller.py

# Check for regressions
behaviorprobe compare base-model finetuned-model

Model Version Upgrade

# Old model
behaviorprobe run --version claude-3-5 --provider anthropic \
                  --model claude-3-5-sonnet-20241022

# New model
behaviorprobe run --version claude-4 --provider anthropic \
                  --model claude-opus-4-20250805

# Compare
behaviorprobe compare claude-3-5 claude-4

Prompt Engineering Evaluation

# Without optimized prompts
behaviorprobe run --version prompts-v1 --provider custom \
                  --module ./base_prompts.py

# With optimized prompts
behaviorprobe run --version prompts-v2 --provider custom \
                  --module ./optimized_prompts.py

# See what changed
behaviorprobe compare prompts-v1 prompts-v2

How Scoring Works

Drift Calculation

Direction Detection

  • More verbose: Response is 15%+ longer
  • More hedged: Contains hedge words (may, might, could, etc.)
  • More clipped: Response is 15%+ shorter
  • Less hedged: Fewer hedge words
  • Stable: No significant change

Regression Status

  • OK: None of the 6 prompts exceed drift threshold
  • [!] REGRESSION: 1 or more prompts flagged

CLI Commands

# Run corpus against a model
behaviorprobe run --version <tag> \
                  --provider <anthropic|openai|ollama|custom> \
                  [--model <model-string>] \
                  [--module <path-to-caller.py>] \
                  [--corpus <path-to-corpus.json>] \
                  [--overwrite]

# Compare two versions
behaviorprobe compare <version_a> <version_b> \
                      [--out <chart.png>]

# List all stored runs
behaviorprobe show

# View raw responses
behaviorprobe inspect <version> \
                      [--type factual|instruction|safety|code|reasoning] \
                      [--truncate <chars>]

Providers

Built-in Providers

Provider Auth Cost Setup Speed
Ollama None Free ollama serve ~20s/run
Anthropic API key $0.01-0.10/run export ANTHROPIC_API_KEY=... ~5s/run
OpenAI API key $0.01-0.05/run export OPENAI_API_KEY=... ~5s/run
Custom Your code Varies Python module + get_caller() Varies

Custom Provider Template

# my_model_caller.py
def get_caller():
    """Return a function that calls your model."""
    
    # Setup (runs once)
    my_model = load_my_model()
    
    def call(prompt: str) -> str:
        """Call the model and return response."""
        return my_model.generate(prompt)
    
    return call

Then run:

behaviorprobe run --version my-model --provider custom \
                  --module ./my_model_caller.py

Project Structure

behaviorprobe/
├── src/promptdrift/          # Core package
│   ├── cli.py                # 4 CLI commands
│   ├── corpus.py             # Load 30 test prompts
│   ├── runner.py             # Execute models + store results
│   ├── scorer.py             # Calculate drift + direction
│   └── reporter.py           # Generate scorecards + charts
├── prompts/corpus.json       # 30 test prompts (5 types × 6)
├── examples/
│   └── custom_callers/       # Provider integration templates
├── tools/                    # Utility scripts
│   ├── query_results.py      # Database explorer
│   ├── compare_responses.py  # Side-by-side diff
│   └── setup_corpus.py       # Initialize corpus
└── CLAUDE.md                 # Full setup guide

Documentation

Document Purpose
CLAUDE.md Complete setup, usage guide, troubleshooting
STRUCTURE.md Directory organization and file purposes
ROADMAP.md Future features, expansion scope, phases
CODE_IMPROVEMENTS.md Code quality changes, PEP8 compliance
COMPLETION_SUMMARY.md Project overview and status

Data Storage

Results stored in ~/.behaviorprobe/results.db (SQLite):

-- All runs with metadata
SELECT version, model, provider, timestamp FROM runs;

-- All responses for a version
SELECT prompt_id, task_type, prompt_text, response 
FROM results WHERE run_id = 1;

No data leaves your machine. Everything is local.


Citation

BehaviorProbe builds on RETAIN (Dixit et al., EMNLP 2024) with task taxonomy stratification, drift direction detection, and zero-infrastructure design.

If you use BehaviorProbe in research, cite it as:

@software{behaviorprobe2026,
  title={BehaviorProbe: Task-Stratified Behavioral Regression Testing for LLM Versions},
  author={Sadhana Sainarayanan},
  year={2026},
  url={https://github.com/SadhanaSai/behaviorprobe}
}

Acknowledge the foundational RETAIN work:

This work builds on RETAIN (Dixit et al., EMNLP 2024), a behavioral regression testing framework for LLMs.


Troubleshooting

Issue Solution
behaviorprobe: command not found Run pip install -e . again
ollama: command not found Install from ollama.com, run ollama serve
Version already has stored runs Use --overwrite flag or different version tag
Custom caller failed Check API key env var is set, verify get_caller() function exists
Unicode errors on Windows Already handled, use latest version

Performance

On a typical machine (2024 MacBook Pro / Windows PC):

  • Per-run time: 15-30 seconds (30 prompts)
    • 5s: API calls to model
    • 5s: Embedding generation
    • 5s: Database operations
  • Disk space: ~100KB per version stored
  • Memory: ~500MB (for embeddings model)

Contributing

Contributions welcome! Areas of focus:

  1. Testing — Unit tests, integration tests
  2. Scoring — Statistical significance, better metrics
  3. Visualization — Enhanced charts, reports
  4. Corpus — Custom task types, domain-specific prompts
  5. Providers — More LLM APIs

See ROADMAP.md for planned features.


License

MIT License - see LICENSE for details.


Limitations & Trade-offs

Sample size: 30 prompts (6 per task type) is fast but underpowered for statistical claims. Good for regression detection and relative comparison; expand to 50-75 for research publication. See KNOWN_LIMITATIONS.md for details.

Drift metric: Cosine distance catches semantic drift but misses factual errors. See KNOWN_LIMITATIONS.md for workarounds and planned metrics.


Quick Links


Questions?

  • How do I use it?CLAUDE.md
  • What should I build next?ROADMAP.md
  • How does it work? → See "How Scoring Works" above
  • Is it production-ready? → Yes for research/testing; roadmap for monitoring

Releases

No releases published

Packages

 
 
 

Contributors

Languages