Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) suffer from hallucination primarily due to overreliance on linguistic priors, scattered visual attention over irrelevant regions, and imbalanced cross-modal fusion. To address this, we propose a “gaze transfer” guidance mechanism that leverages precomputed global visual saliency maps to dynamically enhance cross-modal attention between salient image regions and the user query—mitigating attention collapse and promoting modality balance. Our method comprises three components: (1) visual attention trajectory tracking, (2) saliency map construction, and (3) step-wise attention modulation during autoregressive decoding, augmented by a low-overhead attention reweighting scheme. Evaluated on both generative and classification tasks, our approach significantly reduces hallucination rates—achieving up to a 20.7% improvement over greedy decoding—while preserving competitive performance across standard VLM benchmarks.

Technology Category

Application Category

📝 Abstract
Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a holistic visual saliency map by tracking positive changes in visual attention, or "gaze shifts", during user query comprehension, and leverages this map to amplify attention to both salient visual information and the user query at each decoding step. This reduces the impact of visual attention sink, as irrelevant tokens exhibit minimal shifts, while ensuring balanced cross-modal fusion for well-integrated representation. Extensive experiments show that GIFT effectively mitigates hallucination in VLMs across both generative and classification tasks, achieving up to 20.7% improvement over greedy decoding, while maintaining general vision-language performance with low computational overhead.
Problem

Research questions and friction points this paper is trying to address.

Mitigating visual attention sink in vision-language models
Balancing cross-modal fusion between visual inputs and user queries
Reducing hallucination in VLMs by tracking gaze shifts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaze shift tracking for visual saliency mapping
Balanced cross-modal fusion of vision and language
Amplifying attention to salient regions and user query
🔎 Similar Papers
No similar papers found.