MIHBench: Benchmarking and Mitigating Multi-Image Hallucinations in Multimodal Large Language Models

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work predominantly addresses hallucination in single-image understanding, leaving multi-image hallucination—particularly concerning object existence, count accuracy, and cross-view identity consistency—largely unexplored and unbenchmarked. Method: We introduce MIHBench, the first dedicated benchmark for multi-image object hallucination, comprising three semantic reasoning tasks. Leveraging MIHBench, we systematically analyze key factors influencing hallucination—including image count, per-image hallucination propensity, proportion of same-object images, and negative sample placement—and propose a dynamic attention balancing mechanism within a Transformer architecture. This mechanism jointly models semantic consistency and performs count-aware reasoning. Contribution/Results: Evaluated on multiple state-of-the-art multimodal large language models, our approach significantly reduces hallucination rates, enhances cross-image semantic integration, and improves reasoning robustness. MIHBench establishes a reproducible evaluation framework and an optimization paradigm for multi-image understanding.

Technology Category

Application Category

📝 Abstract
Despite growing interest in hallucination in Multimodal Large Language Models, existing studies primarily focus on single-image settings, leaving hallucination in multi-image scenarios largely unexplored. To address this gap, we conduct the first systematic study of hallucinations in multi-image MLLMs and propose MIHBench, a benchmark specifically tailored for evaluating object-related hallucinations across multiple images. MIHBench comprises three core tasks: Multi-Image Object Existence Hallucination, Multi-Image Object Count Hallucination, and Object Identity Consistency Hallucination, targeting semantic understanding across object existence, quantity reasoning, and cross-view identity consistency. Through extensive evaluation, we identify key factors associated with the occurrence of multi-image hallucinations, including: a progressive relationship between the number of image inputs and the likelihood of hallucination occurrences; a strong correlation between single-image hallucination tendencies and those observed in multi-image contexts; and the influence of same-object image ratios and the positional placement of negative samples within image sequences on the occurrence of object identity consistency hallucination. To address these challenges, we propose a Dynamic Attention Balancing mechanism that adjusts inter-image attention distributions while preserving the overall visual attention proportion. Experiments across multiple state-of-the-art MLLMs demonstrate that our method effectively reduces hallucination occurrences and enhances semantic integration and reasoning stability in multi-image scenarios.
Problem

Research questions and friction points this paper is trying to address.

Study multi-image hallucinations in multimodal large language models
Benchmark object-related hallucinations across multiple images
Mitigate hallucinations via Dynamic Attention Balancing mechanism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Attention Balancing mechanism adjusts attention
MIHBench evaluates multi-image object hallucinations
Analyzes image input number and hallucination likelihood
🔎 Similar Papers
No similar papers found.
J
Jiale Li
Xiamen University, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen, Fujian, China
Mingrui Wu
Mingrui Wu
XMU
MLLMT2I
Z
Zixiang Jin
Xiamen University, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen, Fujian, China
H
Hao Chen
Xiamen University, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen, Fujian, China
Jiayi Ji
Jiayi Ji
Rutgers University
X
Xiaoshuai Sun
Xiamen University, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen, Fujian, China
L
Liujuan Cao
Xiamen University, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen, Fujian, China
R
Rongrong Ji
Xiamen University, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen, Fujian, China