Skip to content

ouvreboite/tool-design-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tool-design-benchmark

This repository is a small agentic tool-design benchmark built around a single use case: searching pets from a fake pet shelter.

The benchmark evaluates how different tool designs affect an LLM’s ability to retrieve the correct pet ids from a fixed dataset.

Requirements

  • jq and grep (for the jq, grep, and grep_and_jq strategies): must be installed and on PATH. On Windows, the benchmark uses Bash (e.g. from Git for Windows) if available; ensure jq and grep are installed in that environment (e.g. via Git Bash or WSL).

Quick start

pip install -r requirements.txt

Set your API key (PowerShell):

$env:OPENROUTER_API_KEY="..."

Run the benchmark:

python scripts/benchmark.py

The benchmark writes a timestamped results file in the repo root:

  • results_YYYYMMDD_HHMMSSffffff.jsonl

Models (OpenRouter)

Edit the model_ids = [...] list in scripts/benchmark.py to benchmark different models.

  • Values should be OpenRouter model IDs (the part after the openrouter: prefix), e.g. anthropic/claude-3.5-sonnet or openai/gpt-4o-mini.
  • The benchmark will call models via the openrouter: prefix as described in the PydanticAI docs: OpenRouter model docs.

Pet dataset (data/animals.json)

The pets dataset is stored in:

  • data/animals.json

Each entry is an animal object with fields like:

  • id: integer
  • name: string
  • species: string (e.g. dog, cat, ...)
  • race: string or null (breed)
  • age: integer (years)
  • adopter: string or null (email)
  • shelter: string (city name)
  • status: available | on_hold | adopted | transferred
  • sex: male | female | unknown
  • temperaments: list of strings

Example:

{
  "id": 21,
  "name": "Willow",
  "species": "dog",
  "race": "labrador retriever",
  "age": 1,
  "adopter": null,
  "shelter": "Nixonville",
  "status": "available",
  "sex": "female",
  "temperaments": ["shy"]
}

Generated using:

  • python scripts/generate_animals.py data/animals.json --count 1000 --cities 20 --adopters 800

Evaluation dataset (data/dataset.jsonl)

Evaluation prompts are stored in JSONL (one JSON object per line):

  • data/dataset.jsonl

Each line contains at least:

  • id
  • input (user question)
  • expected_pet_ids (ground-truth ids)

Example line:

{"id":1,"input":"I live next to South Leslieberg and I would like to adopt a young labrador","expected_pet_ids":[803]}

Tool design strategies

The benchmark runs the same dataset against multiple tool strategies.

Each strategy lives in its own file under:

  • scripts/tool_strategies/

Strategies currently implemented:

  • raw_data: SearchPets() returns the full pets dataset
  • paginated_raw_data: SearchPets(offset, count) returns a JSON slice of the dataset (count capped)
  • exact_text_search: SearchPets(query) returns matching pets by case-insensitive substring search over all values; if > 100 matches, raises ModelRetry
  • fuzzy_text_search: SearchPets(query) does exact substring for query length >= 3, otherwise RapidFuzz fuzzy matching; if > 100 matches, raises ModelRetry
  • fuzzy_text_search_fields: SearchPets(query, field=...) searches within a selected field (or All) with the same exact/fuzzy behavior; if > 100 matches, raises ModelRetry
  • per_field_exact_search: SearchPets(filters=...) returns pets that match ALL provided fields (case-insensitive equality); if > 100 matches, raises ModelRetry
  • per_field_fuzzy_search: SearchPets(filters=...) returns pets that match ALL provided fields using per-field fuzzy matching (hybrid exact/fuzzy); if > 100 matches, raises ModelRetry

To add a new strategy:

  • Create a new module in scripts/tool_strategies/ exposing register(agent, animals_path: Path) -> None
  • Add it to the strategies = [...] list in scripts/benchmark.py

Benchmark output

The results file contains one entry per (model, tool_strategy) with metrics like:

  • correct_pct: fraction of non-exception cases that matched expected_pet_ids (set equality)
  • exception_pct: fraction of cases that errored
  • tokens_avg: average total tokens per case
  • duration_avg: average wall-clock time per case (seconds)

Questions

  1. Does the size of the input impact the accuracy?
  • models: gpt4.1 mini/gemini 2.5 flash
  • variant: 20/100/500/1000
  • strategy: json in system prompt
  1. Do better/larger have a better accuracy at larget input size?
  • models: from 1. + gpt 5 mini / gpt5.2 / gemini 3 flash / gemini 3 pro
  • variant: 1000
  • strategy: json in system prompt
  1. Do the question langague matter
  • models: from 1. + gpt 5 mini / gpt5.2 / gemini 3 flash / gemini 3 pro
  • variant: 1000
  • strategy: json in system prompt
  1. Do the data format matter.
  • models: 2 bests from 2.
  • variant: 1000
  • strategy: json in system prompt, yaml in system prompt, markdown in system prompt, toon in system prompt

About

Benchmarking several tool design stategies for LLMs/MCP

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages