Cross-Layer Vision Smoothing: Enhancing Visual Understanding via Sustained Focus on Key Objects in Large Vision-Language Models

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) achieve accurate localization of salient objects in images but suffer from rapid attention decay across layers, hindering their ability to reason about object relationships and fine-grained attributes. Method: We propose Cross-layer Visual Smoothing (CVS), the first approach introducing an updateable visual memory mechanism that sustains focused attention on key objects throughout multiple Transformer layers. CVS initializes memory with position-agnostic attention, jointly optimizes attention distributions and memory states layer-wise, and dynamically terminates smoothing via uncertainty estimation for perception-complete adaptive control. Contribution/Results: CVS consistently improves relational reasoning and fine-grained attribute recognition across three mainstream LVLMs and four benchmarks, achieving state-of-the-art performance. Extensive experiments demonstrate its effectiveness, robustness, and model-agnostic generalizability.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) can accurately locate key objects in images, yet their attention to these objects tends to be very brief. Motivated by the hypothesis that sustained focus on key objects can improve LVLMs' visual capabilities, we propose Cross-Layer Vision Smoothing (CLVS). The core idea of CLVS is to incorporate a vision memory that smooths the attention distribution across layers. Specifically, we initialize this vision memory with position-unbiased visual attention in the first layer. In subsequent layers, the model's visual attention jointly considers the vision memory from previous layers, while the memory is updated iteratively, thereby maintaining smooth attention on key objects. Given that visual understanding primarily occurs in the early and middle layers of the model, we use uncertainty as an indicator of completed visual understanding and terminate the smoothing process accordingly. Experiments on four benchmarks across three LVLMs confirm the effectiveness and generalizability of our method. CLVS achieves state-of-the-art performance on a variety of visual understanding tasks, with particularly significant improvements in relation and attribute understanding.
Problem

Research questions and friction points this paper is trying to address.

Sustaining attention on key objects in LVLMs
Improving visual understanding through cross-layer smoothing
Enhancing relation and attribute recognition in vision tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision memory smooths attention across layers
Initializes with position-unbiased visual attention
Uses uncertainty to terminate smoothing process
🔎 Similar Papers
J
Jianfei Zhao
School of Computer Science and Technology, Beijing Institute of Technology; Zhongguancun Academy
F
Feng Zhang
School of Computer Science and Technology, Beijing Institute of Technology
X
Xin Sun
School of Computer Science and Technology, Beijing Institute of Technology
L
Lingxing Kong
Tsinghua University
Zhixing Tan
Zhixing Tan
Tsinghua University
Artificial IntelligenceNatural Language ProcessingAI Safety
C
Chong Feng
School of Computer Science and Technology, Beijing Institute of Technology; Southeast Academy of Information Technology, Beijing Institute of Technology