tool-design-benchmark

This repository is a small agentic tool-design benchmark built around a single use case: searching pets from a fake pet shelter.

The benchmark evaluates how different tool designs affect an LLM’s ability to retrieve the correct pet ids from a fixed dataset.

Requirements

jq and grep (for the jq, grep, and grep_and_jq strategies): must be installed and on PATH. On Windows, the benchmark uses Bash (e.g. from Git for Windows) if available; ensure jq and grep are installed in that environment (e.g. via Git Bash or WSL).

Quick start

pip install -r requirements.txt

Set your API key (PowerShell):

$env:OPENROUTER_API_KEY="..."

Run the benchmark:

python scripts/benchmark.py

The benchmark writes a timestamped results file in the repo root:

results_YYYYMMDD_HHMMSSffffff.jsonl

Models (OpenRouter)

Edit the model_ids = [...] list in scripts/benchmark.py to benchmark different models.

Values should be OpenRouter model IDs (the part after the openrouter: prefix), e.g. anthropic/claude-3.5-sonnet or openai/gpt-4o-mini.
The benchmark will call models via the openrouter: prefix as described in the PydanticAI docs: OpenRouter model docs.

Pet dataset (`data/animals.json`)

The pets dataset is stored in:

data/animals.json

Each entry is an animal object with fields like:

id: integer
name: string
species: string (e.g. dog, cat, ...)
race: string or null (breed)
age: integer (years)
adopter: string or null (email)
shelter: string (city name)
status: available | on_hold | adopted | transferred
sex: male | female | unknown
temperaments: list of strings

Example:

{
  "id": 21,
  "name": "Willow",
  "species": "dog",
  "race": "labrador retriever",
  "age": 1,
  "adopter": null,
  "shelter": "Nixonville",
  "status": "available",
  "sex": "female",
  "temperaments": ["shy"]
}

Generated using:

python scripts/generate_animals.py data/animals.json --count 1000 --cities 20 --adopters 800

Evaluation dataset (`data/dataset.jsonl`)

Evaluation prompts are stored in JSONL (one JSON object per line):

data/dataset.jsonl

Each line contains at least:

id
input (user question)
expected_pet_ids (ground-truth ids)

Example line:

{"id":1,"input":"I live next to South Leslieberg and I would like to adopt a young labrador","expected_pet_ids":[803]}

Tool design strategies

The benchmark runs the same dataset against multiple tool strategies.

Each strategy lives in its own file under:

scripts/tool_strategies/

Strategies currently implemented:

raw_data: SearchPets() returns the full pets dataset
paginated_raw_data: SearchPets(offset, count) returns a JSON slice of the dataset (count capped)
exact_text_search: SearchPets(query) returns matching pets by case-insensitive substring search over all values; if > 100 matches, raises ModelRetry
fuzzy_text_search: SearchPets(query) does exact substring for query length >= 3, otherwise RapidFuzz fuzzy matching; if > 100 matches, raises ModelRetry
fuzzy_text_search_fields: SearchPets(query, field=...) searches within a selected field (or All) with the same exact/fuzzy behavior; if > 100 matches, raises ModelRetry
per_field_exact_search: SearchPets(filters=...) returns pets that match ALL provided fields (case-insensitive equality); if > 100 matches, raises ModelRetry
per_field_fuzzy_search: SearchPets(filters=...) returns pets that match ALL provided fields using per-field fuzzy matching (hybrid exact/fuzzy); if > 100 matches, raises ModelRetry

To add a new strategy:

Create a new module in scripts/tool_strategies/ exposing register(agent, animals_path: Path) -> None
Add it to the strategies = [...] list in scripts/benchmark.py

Benchmark output

The results file contains one entry per (model, tool_strategy) with metrics like:

correct_pct: fraction of non-exception cases that matched expected_pet_ids (set equality)
exception_pct: fraction of cases that errored
tokens_avg: average total tokens per case
duration_avg: average wall-clock time per case (seconds)

Questions

Does the size of the input impact the accuracy?

models: gpt4.1 mini/gemini 2.5 flash
variant: 20/100/500/1000
strategy: json in system prompt

Do better/larger have a better accuracy at larget input size?

models: from 1. + gpt 5 mini / gpt5.2 / gemini 3 flash / gemini 3 pro
variant: 1000
strategy: json in system prompt

Do the question langague matter

models: from 1. + gpt 5 mini / gpt5.2 / gemini 3 flash / gemini 3 pro
variant: 1000
strategy: json in system prompt

Do the data format matter.

models: 2 bests from 2.
variant: 1000
strategy: json in system prompt, yaml in system prompt, markdown in system prompt, toon in system prompt

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
benchmarks		benchmarks
data		data
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tool-design-benchmark

Requirements

Quick start

Models (OpenRouter)

Pet dataset (`data/animals.json`)

Evaluation dataset (`data/dataset.jsonl`)

Tool design strategies

Benchmark output

Questions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tool-design-benchmark

Requirements

Quick start

Models (OpenRouter)

Pet dataset (data/animals.json)

Evaluation dataset (data/dataset.jsonl)

Tool design strategies

Benchmark output

Questions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Pet dataset (`data/animals.json`)

Evaluation dataset (`data/dataset.jsonl`)

Packages