MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal benchmarks inadequately distinguish perceptual hallucinations from reasoning hallucinations in Multimodal Large Language Models (MLLMs), hindering precise diagnosis of multimodal reasoning failures. To address this, we introduce MIRAGE—the first benchmark explicitly designed to isolate reasoning hallucinations—by rigorously ensuring image perception correctness while deliberately injecting only logical errors. We propose novel multi-granularity evaluation metrics (accuracy, factuality, and LLM-hallucination score) and, for the first time, systematically uncover correlations among model scale, training stage, and hallucination type. Furthermore, we introduce curriculum-based reinforcement fine-tuning and collaborative prompting for reasoning. Experiments demonstrate that our methods significantly reduce logical hallucinations in base MLLMs; reveal that current MLLMs exhibit no substantial improvement in spatial relational reasoning; and establish the first reproducible baseline on MIRAGE.

Technology Category

Application Category

📝 Abstract
Multimodal hallucination in multimodal large language models (MLLMs) restricts the correctness of MLLMs. However, multimodal hallucinations are multi-sourced and arise from diverse causes. Existing benchmarks fail to adequately distinguish between perception-induced hallucinations and reasoning-induced hallucinations. This failure constitutes a significant issue and hinders the diagnosis of multimodal reasoning failures within MLLMs. To address this, we propose the {dataset} benchmark, which isolates reasoning hallucinations by constructing questions where input images are correctly perceived by MLLMs yet reasoning errors persist. {dataset} introduces multi-granular evaluation metrics: accuracy, factuality, and LLMs hallucination score for hallucination quantification. Our analysis reveals that (1) the model scale, data scale, and training stages significantly affect the degree of logical, fabrication, and factual hallucinations; (2) current MLLMs show no effective improvement on spatial hallucinations caused by misinterpreted spatial relationships, indicating their limited visual reasoning capabilities; and (3) question types correlate with distinct hallucination patterns, highlighting targeted challenges and potential mitigation strategies. To address these challenges, we propose {method}, a method that combines curriculum reinforcement fine-tuning to encourage models to generate logic-consistent reasoning chains by stepwise reducing learning difficulty, and collaborative hint inference to reduce reasoning complexity. {method} establishes a baseline on {dataset}, and reduces the logical hallucinations in original base models.
Problem

Research questions and friction points this paper is trying to address.

Identifying perception vs reasoning-induced hallucinations in MLLMs
Evaluating model scale and data impact on hallucination types
Improving visual reasoning to reduce spatial misinterpretation errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Isolates reasoning hallucinations via benchmark
Uses multi-granular metrics for quantification
Combines curriculum fine-tuning and hint inference
🔎 Similar Papers
No similar papers found.
B
Bowen Dong
Harbin Institute of Technology, The Hong Kong Polytechnic University
Minheng Ni
Minheng Ni
Hong Kong Polytechnic University
Responsible AIGenerative AIMultimodal
Z
Zitong Huang
Harbin Institute of Technology
G
Guanglei Yang
Harbin Institute of Technology
Wangmeng Zuo
Wangmeng Zuo
School of Computer Science and Technology, Harbin Institute of Technology
Computer VisionImage ProcessingGenerative AIDeep LearningBiometrics
L
Lei Zhang
The Hong Kong Polytechnic University