π€ AI Summary
Large Vision-Language Models (LVLMs) frequently suffer from visual hallucinations (VH), primarily due to three pathological attention behaviors: weak visual grounding, language-prior dominance, and prompt redundancy. To address this, we propose VisFlowβa zero-training, plug-and-play framework that performs dual-level attention intervention during inference. At the token level (Token-level Attention Intervention, TAI), it incorporates visual saliency guidance to strengthen attention toward critical visual tokens; at the head level (Head-level Attention Intervention, HAI), it suppresses excessive reliance on system prompts and autoregressive text tokens. This work is the first to systematically identify VH-related attention deficits in LVLMs and realize coordinated token- and head-level interventions. Extensive evaluation across multiple LVLMs and benchmarks demonstrates an average 32.7% reduction in hallucination rates and significant improvements in visual factuality, with negligible inference overhead.
π Abstract
Large vision-language models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, they remain prone to visual hallucination (VH), often producing confident but incorrect descriptions of visual content. We present VisFlow, an efficient and training-free framework designed to mitigate VH by directly manipulating attention patterns during inference. Through systematic analysis, we identify three key pathological attention behaviors in LVLMs: (1) weak visual grounding, where attention to visual tokens is insufficient or misallocated, over-focusing on uninformative regions; (2) language prior dominance, where excessive attention to prior response tokens reinforces autoregressive patterns and impairs multimodal alignment; (3) prompt redundancy, where many attention heads fixate on system prompt tokens, disrupting the integration of image, instruction, and response content. To address these issues, we introduce two inference-time interventions: token-level attention intervention (TAI), which enhances focus on salient visual content, and head-level attention intervention (HAI), which suppresses over-attention to prompt and nearby text tokens. VisFlow operates without additional training or model modifications. Extensive experiments across models and benchmarks show that VisFlow effectively reduces hallucinations and improves visual factuality, with negligible computational cost.