Revealing Multi-View Hallucination in Large Vision-Language Models

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the problem of multi-view hallucination in large vision-language models, where models often confuse distinct instances or perspectives across multiple views. The study presents the first systematic definition and evaluation of this issue, introducing MVH-Bench—a benchmark comprising 4.8k question-answer pairs that encompass both cross-instance and cross-view hallucinations. To mitigate this challenge without additional training, the authors propose Reference Shift Contrastive Decoding (RSCD), a decoding strategy that leverages attention masks to generate negative logits, thereby suppressing visual interference from irrelevant references. Experiments demonstrate that RSCD significantly alleviates multi-view hallucination, achieving absolute accuracy improvements of 21.1 and 34.6 percentage points over existing methods on Qwen2.5-VL and LLaVA-OneVision, respectively.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from different instances or viewpoints, a phenomenon we term multi-view hallucination. To systematically analyze this problem, we construct MVH-Bench, a benchmark comprising 4.8k question-answer pairs targeting two types of hallucination: cross-instance and cross-view. Empirical results show that recent LVLMs struggle to correctly associate visual evidence with its corresponding instance or viewpoint. To overcome this limitation, we propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding technique that suppresses visual interference by generating negative logits through attention masking. Experiments on MVH-Bench with Qwen2.5-VL and LLaVA-OneVision demonstrate that RSCD consistently improves performance by up to 21.1 and 34.6 points over existing hallucination mitigation methods, highlighting the effectiveness of our approach.

Problem

Research questions and friction points this paper is trying to address.

multi-view hallucination

large vision-language models

cross-instance hallucination

cross-view hallucination

visual information mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view hallucination

MVH-Bench

Reference Shift Contrastive Decoding