TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

🎯Overview

TIR-Bench is a comprehensive benchmark designed to evaluate the "thinking-with-images" capabilities of Multimodal Large Language Models (MLLMs), addressing a gap left by existing benchmarks like Visual Search which only test basic operations. As models like OpenAI o3 begin to intelligently create and operate tools to transform images for problem-solving, TIR-Bench provides 13 diverse tasks that each require novel tool use for image processing and manipulation within a chain-of-thought. Our evaluation of 22 leading MLLMs (including open-sourced, proprietary, and tool-augmented models) shows that TIR-Bench is universally challenging and that strong performance requires genuine agentic thinking-with-images capabilities. This repository contains the full benchmark, evaluation scripts, and a pilot study comparing direct versus agentic fine-tuning for this advanced reasoning.

Download data

Please first download and extract images from https://huggingface.co/datasets/Agents-X/TIR-Bench.

Extract answers from model responses.

Run command below:

bash run_extract_answer.sh

Note that response file shoule follow structure below:

{
  '0': content, 
  '1': content
}

Calculate Score

Then run command below:

bash run_calculate_score.sh

Citation

If you use this benchmark in your research, please consider citing it as follows:

@article{li2025tir,
  title={TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning},
  author={Li, Ming and Zhong, Jike and Zhao, Shitian and Zhang, Haoquan and Lin, Shaoheng and Lai, Yuxiang and Chen, Wei and Psounis, Konstantinos and Zhang, Kaipeng},
  journal={arXiv preprint arXiv:2511.01833},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
README.md		README.md
api_config.json		api_config.json
calculate_score.py		calculate_score.py
demo_prompts.py		demo_prompts.py
extract_answer.py		extract_answer.py
run_calculate_score.sh		run_calculate_score.sh
run_extract_answer.sh		run_extract_answer.sh
tools.py		tools.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

🎯Overview

Download data

Extract answers from model responses.

Calculate Score

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

🎯Overview

Download data

Extract answers from model responses.

Calculate Score

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages