π€ AI Summary
This work addresses the degraded sensitivity to visual tokens in deep layers of existing vision-language-action (VLA) models, which leads to inaccurate action prediction in complex manipulation tasks. To mitigate this issue, the authors propose DeepVision-VLA, a framework that enhances the visual grounding of action generation by injecting multi-level visual features into deeper network layers. The approach introduces two key innovations: a Vision-Language Mixture-of-Transformers (VL-MoT) architecture that enables shared attention between vision and language backbones, and an Action-Guided Visual Pruning (AGVP) mechanism that dynamically preserves task-relevant visual tokens based on shallow-layer attention. Experimental results demonstrate that the proposed method outperforms state-of-the-art approaches by 9.0% in simulation and 7.5% in real-world tasks, significantly improving both accuracy and robustness in complex robotic manipulation.
π Abstract
Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbf{Action-Guided Visual Pruning (AGVP)}, which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.