🤖 AI Summary
This work investigates whether cross-modal attention sinks in large vision-language models (LVLMs) are redundant artifacts or essential carriers of global scene priors, and addresses their detrimental suppression of local visual perception. We systematically distinguish, for the first time, between V-sinks originating from the Vision Transformer (ViT) and L-sinks emerging in deeper layers of the language model, revealing a fundamental trade-off between encoding global context and preserving fine-grained visual evidence. To reconcile this tension, we propose a lightweight, layer-wise Sink Gating (LSG) mechanism that dynamically modulates the contribution of attention sinks without requiring task-specific supervision or modifying the frozen backbone. Trained solely via standard next-token prediction and integrated as a modular plug-in, LSG consistently enhances performance across multiple multimodal benchmarks by effectively balancing global reasoning with local visual utilization.
📝 Abstract
Attention sinks are defined as tokens that attract disproportionate attention. While these have been studied in single modality transformers, their cross-modal impact in Large Vision-Language Models (LVLM) remains largely unexplored: are they redundant artifacts or essential global priors? This paper first categorizes visual sinks into two distinct categories: ViT-emerged sinks (V-sinks), which propagate from the vision encoder, and LLM-emerged sinks (L-sinks), which arise within deep LLM layers. Based on the new definition, our analysis reveals a fundamental performance trade-off: while sinks effectively encode global scene-level priors, their dominance can suppress the fine-grained visual evidence required for local perception. Furthermore, we identify specific functional layers where modulating these sinks most significantly impacts downstream performance. To leverage these insights, we propose Layer-wise Sink Gating (LSG), a lightweight, plug-and-play module that dynamically scales the attention contributions of V-sink and the rest visual tokens. LSG is trained via standard next-token prediction, requiring no task-specific supervision while keeping the LVLM backbone frozen. In most layers, LSG yields improvements on representative multimodal benchmarks, effectively balancing global reasoning and precise local evidence.