This repository is a small agentic tool-design benchmark built around a single use case: searching pets from a fake pet shelter.
The benchmark evaluates how different tool designs affect an LLM’s ability to retrieve the correct pet ids from a fixed dataset.
- jq and grep (for the
jq,grep, andgrep_and_jqstrategies): must be installed and onPATH. On Windows, the benchmark uses Bash (e.g. from Git for Windows) if available; ensurejqandgrepare installed in that environment (e.g. via Git Bash or WSL).
pip install -r requirements.txtSet your API key (PowerShell):
$env:OPENROUTER_API_KEY="..."Run the benchmark:
python scripts/benchmark.pyThe benchmark writes a timestamped results file in the repo root:
results_YYYYMMDD_HHMMSSffffff.jsonl
Edit the model_ids = [...] list in scripts/benchmark.py to benchmark different models.
- Values should be OpenRouter model IDs (the part after the
openrouter:prefix), e.g.anthropic/claude-3.5-sonnetoropenai/gpt-4o-mini. - The benchmark will call models via the
openrouter:prefix as described in the PydanticAI docs: OpenRouter model docs.
The pets dataset is stored in:
data/animals.json
Each entry is an animal object with fields like:
id: integername: stringspecies: string (e.g.dog,cat, ...)race: string or null (breed)age: integer (years)adopter: string or null (email)shelter: string (city name)status:available | on_hold | adopted | transferredsex:male | female | unknowntemperaments: list of strings
Example:
{
"id": 21,
"name": "Willow",
"species": "dog",
"race": "labrador retriever",
"age": 1,
"adopter": null,
"shelter": "Nixonville",
"status": "available",
"sex": "female",
"temperaments": ["shy"]
}Generated using:
python scripts/generate_animals.py data/animals.json --count 1000 --cities 20 --adopters 800
Evaluation prompts are stored in JSONL (one JSON object per line):
data/dataset.jsonl
Each line contains at least:
idinput(user question)expected_pet_ids(ground-truth ids)
Example line:
{"id":1,"input":"I live next to South Leslieberg and I would like to adopt a young labrador","expected_pet_ids":[803]}The benchmark runs the same dataset against multiple tool strategies.
Each strategy lives in its own file under:
scripts/tool_strategies/
Strategies currently implemented:
raw_data:SearchPets()returns the full pets datasetpaginated_raw_data:SearchPets(offset, count)returns a JSON slice of the dataset (count capped)exact_text_search:SearchPets(query)returns matching pets by case-insensitive substring search over all values; if > 100 matches, raisesModelRetryfuzzy_text_search:SearchPets(query)does exact substring for query length >= 3, otherwise RapidFuzz fuzzy matching; if > 100 matches, raisesModelRetryfuzzy_text_search_fields:SearchPets(query, field=...)searches within a selected field (orAll) with the same exact/fuzzy behavior; if > 100 matches, raisesModelRetryper_field_exact_search:SearchPets(filters=...)returns pets that match ALL provided fields (case-insensitive equality); if > 100 matches, raisesModelRetryper_field_fuzzy_search:SearchPets(filters=...)returns pets that match ALL provided fields using per-field fuzzy matching (hybrid exact/fuzzy); if > 100 matches, raisesModelRetry
To add a new strategy:
- Create a new module in
scripts/tool_strategies/exposingregister(agent, animals_path: Path) -> None - Add it to the
strategies = [...]list inscripts/benchmark.py
The results file contains one entry per (model, tool_strategy) with metrics like:
correct_pct: fraction of non-exception cases that matchedexpected_pet_ids(set equality)exception_pct: fraction of cases that erroredtokens_avg: average total tokens per caseduration_avg: average wall-clock time per case (seconds)
- Does the size of the input impact the accuracy?
- models: gpt4.1 mini/gemini 2.5 flash
- variant: 20/100/500/1000
- strategy: json in system prompt
- Do better/larger have a better accuracy at larget input size?
- models: from 1. + gpt 5 mini / gpt5.2 / gemini 3 flash / gemini 3 pro
- variant: 1000
- strategy: json in system prompt
- Do the question langague matter
- models: from 1. + gpt 5 mini / gpt5.2 / gemini 3 flash / gemini 3 pro
- variant: 1000
- strategy: json in system prompt
- Do the data format matter.
- models: 2 bests from 2.
- variant: 1000
- strategy: json in system prompt, yaml in system prompt, markdown in system prompt, toon in system prompt