Revealing Multi-View Hallucination in Large Vision-Language Models

πŸ“… 2026-03-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the problem of multi-view hallucination in large vision-language models, where models often confuse distinct instances or perspectives across multiple views. The study presents the first systematic definition and evaluation of this issue, introducing MVH-Benchβ€”a benchmark comprising 4.8k question-answer pairs that encompass both cross-instance and cross-view hallucinations. To mitigate this challenge without additional training, the authors propose Reference Shift Contrastive Decoding (RSCD), a decoding strategy that leverages attention masks to generate negative logits, thereby suppressing visual interference from irrelevant references. Experiments demonstrate that RSCD significantly alleviates multi-view hallucination, achieving absolute accuracy improvements of 21.1 and 34.6 percentage points over existing methods on Qwen2.5-VL and LLaVA-OneVision, respectively.

Technology Category

Application Category

πŸ“ Abstract
Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from different instances or viewpoints, a phenomenon we term multi-view hallucination. To systematically analyze this problem, we construct MVH-Bench, a benchmark comprising 4.8k question-answer pairs targeting two types of hallucination: cross-instance and cross-view. Empirical results show that recent LVLMs struggle to correctly associate visual evidence with its corresponding instance or viewpoint. To overcome this limitation, we propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding technique that suppresses visual interference by generating negative logits through attention masking. Experiments on MVH-Bench with Qwen2.5-VL and LLaVA-OneVision demonstrate that RSCD consistently improves performance by up to 21.1 and 34.6 points over existing hallucination mitigation methods, highlighting the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

multi-view hallucination
large vision-language models
cross-instance hallucination
cross-view hallucination
visual information mismatch
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view hallucination
MVH-Bench
Reference Shift Contrastive Decoding
vision-language models
training-free decoding
πŸ”Ž Similar Papers
No similar papers found.