TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing visual reasoning benchmarks inadequately assess models’ higher-order cognitive capabilities in dynamic, tool-dependent image manipulation tasks. To address this gap, we introduce TIR-Bench—the first comprehensive benchmark for embodied image-thinking reasoning—comprising 13 diverse visual tasks requiring multi-step tool invocation, thereby transcending conventional static operation evaluations (e.g., localization or cropping). We propose a chain-of-reasoning framework powered by multimodal large language models (MLLMs), explicitly modeling autonomous tool creation and iterative image-level operations. Systematic evaluation across 22 state-of-the-art multimodal models reveals consistently poor performance, confirming TIR-Bench’s high difficulty and highlighting image-level thinking as both a critical bottleneck and fundamental requirement for advanced visual reasoning.

Technology Category

Application Category

📝 Abstract

The frontier of visual reasoning is shifting toward models like OpenAI o3, which can intelligently create and operate tools to transform images for problem-solving, also known as thinking- extit{with}-images in chain-of-thought. Yet existing benchmarks fail to fully capture this advanced capability. Even Visual Search, the most common benchmark for current thinking- extit{with}-images methods, tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning. We introduce extbf{TIR-Bench}, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation in chain-of-thought. We evaluate 22 multimodal large language models (MLLMs), from leading open-sourced and proprietary models to those with explicit tool-use augmentation. Results show that TIR-Bench is universally challenging, and strong performance requires genuine thinking-with-images capabilities. Finally, we present a pilot study comparing direct versus agentic fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks fail to capture advanced visual reasoning

Current methods only test basic image operations like cropping

TIR-Bench evaluates complex tool-dependent image manipulation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces TIR-Bench benchmark for agentic thinking-with-images

Evaluates models across 13 tasks requiring novel tool use

Compares direct versus agentic fine-tuning approaches

🔎 Similar Papers

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision Language Models