Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation

📅 2025-06-14

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) frequently suffer from visual hallucinations (VH), primarily due to three pathological attention behaviors: weak visual grounding, language-prior dominance, and prompt redundancy. To address this, we propose VisFlow—a zero-training, plug-and-play framework that performs dual-level attention intervention during inference. At the token level (Token-level Attention Intervention, TAI), it incorporates visual saliency guidance to strengthen attention toward critical visual tokens; at the head level (Head-level Attention Intervention, HAI), it suppresses excessive reliance on system prompts and autoregressive text tokens. This work is the first to systematically identify VH-related attention deficits in LVLMs and realize coordinated token- and head-level interventions. Extensive evaluation across multiple LVLMs and benchmarks demonstrates an average 32.7% reduction in hallucination rates and significant improvements in visual factuality, with negligible inference overhead.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, they remain prone to visual hallucination (VH), often producing confident but incorrect descriptions of visual content. We present VisFlow, an efficient and training-free framework designed to mitigate VH by directly manipulating attention patterns during inference. Through systematic analysis, we identify three key pathological attention behaviors in LVLMs: (1) weak visual grounding, where attention to visual tokens is insufficient or misallocated, over-focusing on uninformative regions; (2) language prior dominance, where excessive attention to prior response tokens reinforces autoregressive patterns and impairs multimodal alignment; (3) prompt redundancy, where many attention heads fixate on system prompt tokens, disrupting the integration of image, instruction, and response content. To address these issues, we introduce two inference-time interventions: token-level attention intervention (TAI), which enhances focus on salient visual content, and head-level attention intervention (HAI), which suppresses over-attention to prompt and nearby text tokens. VisFlow operates without additional training or model modifications. Extensive experiments across models and benchmarks show that VisFlow effectively reduces hallucinations and improves visual factuality, with negligible computational cost.

Problem

Research questions and friction points this paper is trying to address.

Mitigates visual hallucination in vision-language models

Addresses weak visual grounding and language dominance

Reduces prompt redundancy in attention patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level attention intervention enhances visual focus

Head-level attention intervention suppresses text over-attention

Training-free framework manipulates attention during inference

🔎 Similar Papers

DiminishAR: Diminishing Visual Distractions via Holographic AR Displays