Large Vision-Language Models Get Lost in Attention

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

228K/year
🤖 AI Summary
This work addresses functional mismatch and redundancy in the attention mechanisms of current large vision-language models, which fail to efficiently exploit visual context. By establishing a unified framework grounded in information theory and information geometry, the study quantifies the geometric structure and entropy characteristics of residual updates, revealing a functional decoupling between attention mechanisms and feed-forward networks (FFNs) in subspace operations. For the first time from an information-geometric perspective, it clarifies their distinct intrinsic roles and demonstrates that attention can be replaced by predefined weights—such as those derived from Gaussian noise—without performance degradation. Empirical results show that this simplified model matches or even surpasses the original architecture across multiple benchmarks, challenging the prevailing design paradigm reliant on dynamic attention and confirming its substantial redundancy.
📝 Abstract
Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded in information theory and geometry to quantify the geometric and entropic nature of residual updates. Applying this unified framework reveals a fundamental functional decoupling: Attention acts as a subspace-preserving operator focused on reconfiguration, whereas FFNs serve as subspace-expanding operators driving semantic innovation. Strikingly, further experiments demonstrate that replacing learned attention weights with predefined values (e.g., Gaussian noise) yields comparable or even superior performance across a majority of datasets relative to vanilla models. These results expose severe misallocation and redundancy in current mechanisms, suggesting that state-of-the-art LVLMs effectively ``get lost in attention'' rather than efficiently leveraging visual context.
Problem

Research questions and friction points this paper is trying to address.

Large Vision-Language Models
Attention Mechanism
Model Redundancy
Visual Context Utilization
Residual Connections
Innovation

Methods, ideas, or system contributions that make the work stand out.

information theory
geometric analysis
attention mechanism
functional decoupling
vision-language models
G
Gongli Xi
School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing, China
Y
Ye Tian
School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing, China
M
Mengyu Yang
School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing, China
H
Huahui Yi
Nanyang Technological University, Singapore
Liang Lin
Liang Lin
Fellow of IEEE/IAPR, Professor of Computer Science, Sun Yat-sen University
Embodied AICausal Inference and LearningMultimodal Data Analysis
Xiaoshuai Hao
Xiaoshuai Hao
Beijing Academy of Artificial Intelligence,BAAI
vision and language
Kun Wang
Kun Wang
Singapore University of Technology and Design
Deep LearningComputer Vision
Wendong Wang
Wendong Wang
China University of Petroleum(East China)
Flow in Porous MediaCCUSUnconventional Resource Development