TIR-Bench is a comprehensive benchmark designed to evaluate the "thinking-with-images" capabilities of Multimodal Large Language Models (MLLMs), addressing a gap left by existing benchmarks like Visual Search which only test basic operations. As models like OpenAI o3 begin to intelligently create and operate tools to transform images for problem-solving, TIR-Bench provides 13 diverse tasks that each require novel tool use for image processing and manipulation within a chain-of-thought. Our evaluation of 22 leading MLLMs (including open-sourced, proprietary, and tool-augmented models) shows that TIR-Bench is universally challenging and that strong performance requires genuine agentic thinking-with-images capabilities. This repository contains the full benchmark, evaluation scripts, and a pilot study comparing direct versus agentic fine-tuning for this advanced reasoning.
Please first download and extract images from https://huggingface.co/datasets/Agents-X/TIR-Bench.
Run command below:
bash run_extract_answer.shNote that response file shoule follow structure below:
{
'0': content,
'1': content
}Then run command below:
bash run_calculate_score.shIf you use this benchmark in your research, please consider citing it as follows:
@article{li2025tir,
title={TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning},
author={Li, Ming and Zhong, Jike and Zhao, Shitian and Zhang, Haoquan and Lin, Shaoheng and Lai, Yuxiang and Chen, Wei and Psounis, Konstantinos and Zhang, Kaipeng},
journal={arXiv preprint arXiv:2511.01833},
year={2025}
}