When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether cross-modal attention sinks in large vision-language models (LVLMs) are redundant artifacts or essential carriers of global scene priors, and addresses their detrimental suppression of local visual perception. We systematically distinguish, for the first time, between V-sinks originating from the Vision Transformer (ViT) and L-sinks emerging in deeper layers of the language model, revealing a fundamental trade-off between encoding global context and preserving fine-grained visual evidence. To reconcile this tension, we propose a lightweight, layer-wise Sink Gating (LSG) mechanism that dynamically modulates the contribution of attention sinks without requiring task-specific supervision or modifying the frozen backbone. Trained solely via standard next-token prediction and integrated as a modular plug-in, LSG consistently enhances performance across multiple multimodal benchmarks by effectively balancing global reasoning with local visual utilization.
📝 Abstract
Attention sinks are defined as tokens that attract disproportionate attention. While these have been studied in single modality transformers, their cross-modal impact in Large Vision-Language Models (LVLM) remains largely unexplored: are they redundant artifacts or essential global priors? This paper first categorizes visual sinks into two distinct categories: ViT-emerged sinks (V-sinks), which propagate from the vision encoder, and LLM-emerged sinks (L-sinks), which arise within deep LLM layers. Based on the new definition, our analysis reveals a fundamental performance trade-off: while sinks effectively encode global scene-level priors, their dominance can suppress the fine-grained visual evidence required for local perception. Furthermore, we identify specific functional layers where modulating these sinks most significantly impacts downstream performance. To leverage these insights, we propose Layer-wise Sink Gating (LSG), a lightweight, plug-and-play module that dynamically scales the attention contributions of V-sink and the rest visual tokens. LSG is trained via standard next-token prediction, requiring no task-specific supervision while keeping the LVLM backbone frozen. In most layers, LSG yields improvements on representative multimodal benchmarks, effectively balancing global reasoning and precise local evidence.
Problem

Research questions and friction points this paper is trying to address.

attention sinks
vision-language models
cross-modal attention
global priors
local perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

attention sink
vision-language models
layer-wise gating
global-local trade-off
plug-and-play module
Jiho Choi
Jiho Choi
KAIST
J
Jaemin Kim
Chung-Ang University, South Korea
Sanghwan Kim
Sanghwan Kim
Technical University of Munich, Helmholtz Munich
machine learning and deep learning
S
Seunghoon Hong
KAIST, South Korea
J
Jin-Hwi Park
Chung-Ang University, South Korea