Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from visual hallucinations and over-reliance on textual priors during visual reasoning. Method: We propose a tool-augmented agent architecture that decouples the LLM from a lightweight, specialized vision module, enabling fine-grained visual analysis and iterative reasoning via chain-of-thought guidance. We introduce a three-stage diagnostic evaluation framework to systematically uncover failure modes of mainstream MLLMs and design a modular, interpretable agent workflow supporting dynamic visual tool invocation and result verification. Contribution/Results: Our approach achieves +10.3 and +6.0 absolute improvements on MMMU and MathVista, respectively—surpassing same-parameter-scale models and approaching the performance of significantly larger ones. The code and evaluation framework are publicly released.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight future visual reasoning models should focus on integrating a broader set of specialized tools for analyzing visual content. Our system achieves significant gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our framework and evaluation suite to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

Diagnosing visual hallucinations in multimodal reasoning models

Addressing over-reliance on textual priors in vision-language systems

Developing specialized tools for fine-grained visual analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent-based architecture combines LLM with visual modules

Enables fine-grained analysis and iterative reasoning refinement

Integrates specialized tools for detailed visual content analysis

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?