IKOD: Mitigating Visual Attention Degradation in Large Vision-Language Models

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from visual attention decay during long-sequence generation, exacerbating image-irrelevant hallucinations—a phenomenon whose root cause has remained unclear. This work is the first to systematically establish visual attention decay as a key driver of long-range hallucinations. We propose IKOD, a training-free, lightweight, and general-purpose collaborative decoding framework. Its core is a key-value-merging-based attention guidance mechanism that dynamically fuses high visual-fidelity logits from short-sequence decoding with original autoregressive outputs, enabling sustained injection of image information throughout generation. Evaluated across multiple hallucination benchmarks, IKOD significantly reduces hallucination rates (average reduction of 21.3%) while improving overall generation quality. It is compatible with mainstream LVLMs, incurs zero additional parameters, and enables efficient deployment.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated significant progress across multiple domains. However, these models still face the inherent challenge of integrating vision and language for collaborative inference, which often leads to "hallucinations", outputs that are not grounded in the corresponding images. Many efforts have been made to address these issues, but each comes with its own limitations, such as high computational cost or expensive dataset annotation. Recent research shows that LVLMs exhibit a long-term bias where hallucinations increase as the sequence length grows, yet the underlying cause remains poorly understood. Building on extensive research into attention mechanisms in LVLMs, we analyze the relationship between this long-term bias and visual attention. In our research, we identify a consistent phenomenon in current LVLMs: the model's attention to visual input diminishes as the generated sequence grows, which we hypothesize to be a key factor contributing to observed increasing hallucinations. Based on these insights, we propose Image attention-guided Key-value merging cOllaborative Decoding (IKOD), a collaborative decoding strategy generating more image-focused sequences. This method derives logits from shorter sequences with higher image attention through key-value merging and combines them with those from the original decoding, effectively mitigating attention degradation and suppressing hallucinations while not incurring too much inference cost. Extensive experiments on both hallucination and comprehensive benchmarks demonstrate IKOD's superior effectiveness in mitigating hallucinations and improving comprehensive capacities for LVLMs. Importantly, IKOD requires no additional training or external tools, making it a lightweight and efficient framework applicable to various models.
Problem

Research questions and friction points this paper is trying to address.

Mitigating visual attention degradation in LVLMs
Reducing hallucinations in long sequence generation
Improving image focus without high computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Key-value merging for attention enhancement
Collaborative decoding to reduce hallucinations
Lightweight framework without extra training
🔎 Similar Papers
No similar papers found.
J
Jiabing Yang
School of Artificial Intelligence, University of Chinese Academy of Sciences; NLPR, Institute of Automation, Chinese Academy of Sciences
Chenhang Cui
Chenhang Cui
National University of Singapore
AI AlignmentFoundation ModelsAI safety
Yiyang Zhou
Yiyang Zhou
Ph.D. Student, UNC Chapel Hill CS
Natural Language ProcessingMultimodalMachine learning
Y
Yixiang Chen
School of Artificial Intelligence, University of Chinese Academy of Sciences; NLPR, Institute of Automation, Chinese Academy of Sciences
Peng Xia
Peng Xia
PhD student, Department of Computer Science, UNC Chapel Hill
Multimodal AgentHealthcare
Ying Wei
Ying Wei
Zhejiang University
Machine LearningTransfer LearningContinual LearningAI for Science
T
Tao Yu
School of Artificial Intelligence, University of Chinese Academy of Sciences; NLPR, Institute of Automation, Chinese Academy of Sciences
Y
Yan Huang
School of Artificial Intelligence, University of Chinese Academy of Sciences; NLPR, Institute of Automation, Chinese Academy of Sciences
L
Liang Wang
School of Artificial Intelligence, University of Chinese Academy of Sciences; NLPR, Institute of Automation, Chinese Academy of Sciences