🤖 AI Summary
This work addresses functional mismatch and redundancy in the attention mechanisms of current large vision-language models, which fail to efficiently exploit visual context. By establishing a unified framework grounded in information theory and information geometry, the study quantifies the geometric structure and entropy characteristics of residual updates, revealing a functional decoupling between attention mechanisms and feed-forward networks (FFNs) in subspace operations. For the first time from an information-geometric perspective, it clarifies their distinct intrinsic roles and demonstrates that attention can be replaced by predefined weights—such as those derived from Gaussian noise—without performance degradation. Empirical results show that this simplified model matches or even surpasses the original architecture across multiple benchmarks, challenging the prevailing design paradigm reliant on dynamic attention and confirming its substantial redundancy.
📝 Abstract
Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded in information theory and geometry to quantify the geometric and entropic nature of residual updates. Applying this unified framework reveals a fundamental functional decoupling: Attention acts as a subspace-preserving operator focused on reconfiguration, whereas FFNs serve as subspace-expanding operators driving semantic innovation. Strikingly, further experiments demonstrate that replacing learned attention weights with predefined values (e.g., Gaussian noise) yields comparable or even superior performance across a majority of datasets relative to vanilla models. These results expose severe misallocation and redundancy in current mechanisms, suggesting that state-of-the-art LVLMs effectively ``get lost in attention'' rather than efficiently leveraging visual context.