MIHBench: Benchmarking and Mitigating Multi-Image Hallucinations in Multimodal Large Language Models

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Prior work predominantly addresses hallucination in single-image understanding, leaving multi-image hallucination—particularly concerning object existence, count accuracy, and cross-view identity consistency—largely unexplored and unbenchmarked. Method: We introduce MIHBench, the first dedicated benchmark for multi-image object hallucination, comprising three semantic reasoning tasks. Leveraging MIHBench, we systematically analyze key factors influencing hallucination—including image count, per-image hallucination propensity, proportion of same-object images, and negative sample placement—and propose a dynamic attention balancing mechanism within a Transformer architecture. This mechanism jointly models semantic consistency and performs count-aware reasoning. Contribution/Results: Evaluated on multiple state-of-the-art multimodal large language models, our approach significantly reduces hallucination rates, enhances cross-image semantic integration, and improves reasoning robustness. MIHBench establishes a reproducible evaluation framework and an optimization paradigm for multi-image understanding.

Technology Category

Application Category

📝 Abstract

Despite growing interest in hallucination in Multimodal Large Language Models, existing studies primarily focus on single-image settings, leaving hallucination in multi-image scenarios largely unexplored. To address this gap, we conduct the first systematic study of hallucinations in multi-image MLLMs and propose MIHBench, a benchmark specifically tailored for evaluating object-related hallucinations across multiple images. MIHBench comprises three core tasks: Multi-Image Object Existence Hallucination, Multi-Image Object Count Hallucination, and Object Identity Consistency Hallucination, targeting semantic understanding across object existence, quantity reasoning, and cross-view identity consistency. Through extensive evaluation, we identify key factors associated with the occurrence of multi-image hallucinations, including: a progressive relationship between the number of image inputs and the likelihood of hallucination occurrences; a strong correlation between single-image hallucination tendencies and those observed in multi-image contexts; and the influence of same-object image ratios and the positional placement of negative samples within image sequences on the occurrence of object identity consistency hallucination. To address these challenges, we propose a Dynamic Attention Balancing mechanism that adjusts inter-image attention distributions while preserving the overall visual attention proportion. Experiments across multiple state-of-the-art MLLMs demonstrate that our method effectively reduces hallucination occurrences and enhances semantic integration and reasoning stability in multi-image scenarios.

Problem

Research questions and friction points this paper is trying to address.

Study multi-image hallucinations in multimodal large language models

Benchmark object-related hallucinations across multiple images

Mitigate hallucinations via Dynamic Attention Balancing mechanism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Attention Balancing mechanism adjusts attention

MIHBench evaluates multi-image object hallucinations

Analyzes image input number and hallucination likelihood

🔎 Similar Papers

Hallucination of Multimodal Large Language Models: A Survey