Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

πŸ“… 2026-03-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the degraded sensitivity to visual tokens in deep layers of existing vision-language-action (VLA) models, which leads to inaccurate action prediction in complex manipulation tasks. To mitigate this issue, the authors propose DeepVision-VLA, a framework that enhances the visual grounding of action generation by injecting multi-level visual features into deeper network layers. The approach introduces two key innovations: a Vision-Language Mixture-of-Transformers (VL-MoT) architecture that enables shared attention between vision and language backbones, and an Action-Guided Visual Pruning (AGVP) mechanism that dynamically preserves task-relevant visual tokens based on shallow-layer attention. Experimental results demonstrate that the proposed method outperforms state-of-the-art approaches by 9.0% in simulation and 7.5% in real-world tasks, significantly improving both accuracy and robustness in complex robotic manipulation.

Technology Category

Application Category

πŸ“ Abstract
Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbf{Action-Guided Visual Pruning (AGVP)}, which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
visual representation
action generation
visual grounding
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action Models
Mixture-of-Transformers
Visual Token Pruning
Action-Guided Attention
Deep Vision Integration
πŸ”Ž Similar Papers
Yulin Luo
Yulin Luo
Peking University
Data-centric AILLMVLMEmbodied AI
H
Hao Chen
The Chinese University of Hong Kong
Z
Zhuangzhe Wu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
B
Bowen Sui
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
J
Jiaming Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Chenyang Gu
Chenyang Gu
Undergraduate, Peking University
Embodied AIRobotic Manipulation
Zhuoyang Liu
Zhuoyang Liu
Peking University
Embodied AIComputer Vision
Q
Qiuxuan Feng
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Jiale Yu
Jiale Yu
δΈ­ε›½η§‘ε­¦ζŠ€ζœ―ε€§ε­¦
S
Shuo Gu
Simplexity Robotics
P
Peng Jia
Simplexity Robotics
P
Pheng-Ann Heng
The Chinese University of Hong Kong
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models