When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the absence of visual intermediate representations in multimodal reasoning. We introduce MIRA, the first benchmark requiring models to generate intermediate visual images—such as sketches, structural diagrams, or path maps—to support complex reasoning. Unlike text-only chain-of-thought (CoT), MIRA mandates “thinking by drawing,” explicitly modeling spatial relationships and structural constraints that are difficult to express linguistically. We construct a multimodal evaluation set with human-annotated visual cues and ground-truth answers, design a three-level unified evaluation protocol, and propose the Visual-CoT input paradigm. Experiments show that current multimodal large language models (MLLMs) underperform with text-only prompting; incorporating generated visual intermediates yields an average 33.7% performance gain. This demonstrates the critical role of visualized intermediate representations in enhancing complex reasoning and overcomes the expressive limitations of conventional CoT.

Technology Category

Application Category

📝 Abstract

We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through"drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.

Problem

Research questions and friction points this paper is trying to address.

Evaluating models requiring intermediate visual generation for reasoning

Addressing tasks with complex structures beyond text-only expression

Benchmarking multimodal reasoning with annotated visual clues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generating intermediate visual images for reasoning

Using sketches and diagrams as thinking guides

Evaluating models with visual chain-of-thought inputs

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

2024-06-14arXiv.orgCitations: 15

Anthropic

$350,000—$850,000 USD

San Francisco, CA, USA

AI Research Scientist, VLM (vision language models)