When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work investigates whether cross-modal attention sinks in large vision-language models (LVLMs) are redundant artifacts or essential carriers of global scene priors, and addresses their detrimental suppression of local visual perception. We systematically distinguish, for the first time, between V-sinks originating from the Vision Transformer (ViT) and L-sinks emerging in deeper layers of the language model, revealing a fundamental trade-off between encoding global context and preserving fine-grained visual evidence. To reconcile this tension, we propose a lightweight, layer-wise Sink Gating (LSG) mechanism that dynamically modulates the contribution of attention sinks without requiring task-specific supervision or modifying the frozen backbone. Trained solely via standard next-token prediction and integrated as a modular plug-in, LSG consistently enhances performance across multiple multimodal benchmarks by effectively balancing global reasoning with local visual utilization.

Technology Category

Application Category

📝 Abstract

Attention sinks are defined as tokens that attract disproportionate attention. While these have been studied in single modality transformers, their cross-modal impact in Large Vision-Language Models (LVLM) remains largely unexplored: are they redundant artifacts or essential global priors? This paper first categorizes visual sinks into two distinct categories: ViT-emerged sinks (V-sinks), which propagate from the vision encoder, and LLM-emerged sinks (L-sinks), which arise within deep LLM layers. Based on the new definition, our analysis reveals a fundamental performance trade-off: while sinks effectively encode global scene-level priors, their dominance can suppress the fine-grained visual evidence required for local perception. Furthermore, we identify specific functional layers where modulating these sinks most significantly impacts downstream performance. To leverage these insights, we propose Layer-wise Sink Gating (LSG), a lightweight, plug-and-play module that dynamically scales the attention contributions of V-sink and the rest visual tokens. LSG is trained via standard next-token prediction, requiring no task-specific supervision while keeping the LVLM backbone frozen. In most layers, LSG yields improvements on representative multimodal benchmarks, effectively balancing global reasoning and precise local evidence.

Problem

Research questions and friction points this paper is trying to address.

attention sinks

vision-language models

cross-modal attention

global priors

local perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

attention sink

vision-language models

layer-wise gating