Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

📅 2024-05-28

🏛️ arXiv.org

📈 Citations: 13

✨ Influential: 2

career value

188K/year

🤖 AI Summary

LVLMs frequently exhibit hallucinations in image understanding due to attention misalignment—specifically, excessive reliance on image tokens irrelevant to the query (“blind tokens”). This work is the first to formally define and systematically identify this phenomenon. We propose AvisC, a test-time dynamic calibration framework that requires no fine-tuning and preserves the original model architecture. AvisC localizes blind tokens via cross-layer attention distribution analysis and corrects responses through contrastive decoding and logit reweighting. Evaluated on major benchmarks—including POPE, MME, and AMBER—AvisC significantly reduces hallucination rates while simultaneously improving accuracy and reliability across visual question answering and image captioning tasks. By enabling lightweight, architecture-agnostic, and plug-and-play intervention, AvisC advances trustworthy reasoning in LVLMs without compromising inference efficiency or model integrity.

Technology Category

Application Category

📝 Abstract

Large Vision Language Models (LVLMs) demonstrate strong capabilities in visual understanding and description, yet often suffer from hallucinations, attributing incorrect or misleading features to images. We observe that LVLMs disproportionately focus on a small subset of image tokens--termed blind tokens--which are typically irrelevant to the query (e.g., background or non-object regions). We hypothesize that such attention misalignment plays a key role in generating hallucinated responses. To mitigate this issue, we propose Attentional Vision Calibration (AvisC), a test-time approach that dynamically recalibrates the influence of blind tokens without modifying the underlying attention mechanism. AvisC first identifies blind tokens by analyzing layer-wise attention distributions over image tokens, then employs a contrastive decoding strategy to balance the influence of original and blind-token-biased logits. Experiments on standard benchmarks, including POPE, MME, and AMBER, demonstrate that AvisC effectively reduces hallucinations in LVLMs.

Problem

Research questions and friction points this paper is trying to address.

LVLMs focus on irrelevant image tokens causing hallucinations

Attention misalignment leads to incorrect feature attribution

Dynamic recalibration needed to reduce LVLM hallucinations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic recalibration of blind tokens influence

Contrastive decoding balances original and biased logits

Layer-wise attention analysis identifies blind tokens

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment