Skip to content

agents-x-project/TIR-Bench

Repository files navigation

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

arXiv HF Model: ViGaL

🎯Overview

TIR-Bench is a comprehensive benchmark designed to evaluate the "thinking-with-images" capabilities of Multimodal Large Language Models (MLLMs), addressing a gap left by existing benchmarks like Visual Search which only test basic operations. As models like OpenAI o3 begin to intelligently create and operate tools to transform images for problem-solving, TIR-Bench provides 13 diverse tasks that each require novel tool use for image processing and manipulation within a chain-of-thought. Our evaluation of 22 leading MLLMs (including open-sourced, proprietary, and tool-augmented models) shows that TIR-Bench is universally challenging and that strong performance requires genuine agentic thinking-with-images capabilities. This repository contains the full benchmark, evaluation scripts, and a pilot study comparing direct versus agentic fine-tuning for this advanced reasoning.

Download data

Please first download and extract images from https://huggingface.co/datasets/Agents-X/TIR-Bench.

Extract answers from model responses.

Run command below:

bash run_extract_answer.sh

Note that response file shoule follow structure below:

{
  '0': content, 
  '1': content
}

Calculate Score

Then run command below:

bash run_calculate_score.sh

Citation

If you use this benchmark in your research, please consider citing it as follows:

@article{li2025tir,
  title={TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning},
  author={Li, Ming and Zhong, Jike and Zhao, Shitian and Zhang, Haoquan and Lin, Shaoheng and Lai, Yuxiang and Chen, Wei and Psounis, Konstantinos and Zhang, Kaipeng},
  journal={arXiv preprint arXiv:2511.01833},
  year={2025}
}

About

Official implementation of "TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors