π€ AI Summary
This work addresses the problem of multi-view hallucination in large vision-language models, where models often confuse distinct instances or perspectives across multiple views. The study presents the first systematic definition and evaluation of this issue, introducing MVH-Benchβa benchmark comprising 4.8k question-answer pairs that encompass both cross-instance and cross-view hallucinations. To mitigate this challenge without additional training, the authors propose Reference Shift Contrastive Decoding (RSCD), a decoding strategy that leverages attention masks to generate negative logits, thereby suppressing visual interference from irrelevant references. Experiments demonstrate that RSCD significantly alleviates multi-view hallucination, achieving absolute accuracy improvements of 21.1 and 34.6 percentage points over existing methods on Qwen2.5-VL and LLaVA-OneVision, respectively.
π Abstract
Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from different instances or viewpoints, a phenomenon we term multi-view hallucination. To systematically analyze this problem, we construct MVH-Bench, a benchmark comprising 4.8k question-answer pairs targeting two types of hallucination: cross-instance and cross-view. Empirical results show that recent LVLMs struggle to correctly associate visual evidence with its corresponding instance or viewpoint. To overcome this limitation, we propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding technique that suppresses visual interference by generating negative logits through attention masking. Experiments on MVH-Bench with Qwen2.5-VL and LLaVA-OneVision demonstrate that RSCD consistently improves performance by up to 21.1 and 34.6 points over existing hallucination mitigation methods, highlighting the effectiveness of our approach.