More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This paper identifies a critical degradation in visual grounding capability and an increase in hallucination in multimodal large language models (MLLMs) when generating long reasoning chains: longer chains induce greater reliance on linguistic priors at the expense of image perception. To address this, we propose RH-AUC—a metric quantifying the area under the curve of reasoning length versus visual grounding performance—and RH-Bench, a diagnostic benchmark that, for the first time, quantitatively characterizes how reasoning length impacts visual grounding. Through attention analysis, visual grounding quantification, and multi-task joint evaluation, we find that domain diversity in training data substantially outperforms mere scale expansion; moreover, superior MLLMs achieve better trade-offs between reasoning depth and visual fidelity. Our work provides both a quantitative diagnostic toolkit and actionable design principles to enhance the reliability of multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.

Problem

Research questions and friction points this paper is trying to address.

Assessing hallucination in multimodal reasoning models

Evaluating visual grounding during extended reasoning chains

Balancing reasoning ability and perceptual fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces RH-AUC metric for perception accuracy

Releases RH-Bench for multimodal task assessment

Analyzes model balance via training data types

🔎 Similar Papers

Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models