2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

205K/year
🤖 AI Summary
This work addresses the challenge of excessive token count in vision-language-action (VLA) models when incorporating 3D visual inputs, which significantly hinders inference efficiency. Existing token pruning methods fail to account for the dynamic saliency differences between 2D and 3D modalities, leading to suboptimal trade-offs between computational efficiency and task accuracy. To overcome this limitation, the study systematically uncovers, for the first time, the distinct and varying saliency patterns of 2D and 3D tokens during model inference. Building on this insight, the authors propose a novel three-stage token pruning framework that dynamically and precisely selects critical tokens throughout the inference pipeline. The method achieves up to 2.55× inference speedup with only a 5.8% overhead and minimal degradation in accuracy.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have emerged as the mainstream of embodied intelligence. Recent VLA models have expanded their input modalities from 2D-only to 2D+3D paradigms, forming multi-visual-modal VLA (MVLA) models. Despite achieving improved spatial perception, MVLA faces a greater acceleration demand due to the increased number of input tokens caused by modal expansion. Token pruning is an effective optimization methods tailored to MVLA models. However, existing token pruning schemes are designed for 2D-only VLA models, ignoring 2D/3D modality salience differences. In this paper, we follow the application process of multi-modal data in MVLA models and develop a tri-stage analysis to capture the discrepancy and dynamics of 2D/3D modality salience. Based on these, we propose a corresponding tri-stage token pruning framework for MVLA models to achieve optimal 2D/3D token selection and efficient pruning. Experiments show that our framework achieves up to a 2.55x inference speedup with minimal accuracy loss, while only costing 5.8% overhead. Our Code is coming soon.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
token pruning
modality salience
2D+3D
embodied intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

token pruning
modality salience
Vision-Language-Action (VLA)
multi-modal fusion
inference acceleration
🔎 Similar Papers
No similar papers found.
Zihao Zheng
Zihao Zheng
Peking University
Machine Learning SystemEdge ComputingComputer ArchitectureEDA
S
Sicheng Tian
School of Artificial Intelligence, Beijing Normal University
Z
Zhihao Mao
School of Computer Science, China University of Geosciences (Wuhan)
L
Lingyue Zhang
School of Electronics Engineering and Computer Science, Peking University
Chenyue Li
Chenyue Li
Hong Kong University of Science and Technology
AI for ScienceLarge Language Model
Z
Ziyun Zhang
School of Computer Science, Peking University
Hong Gao
Hong Gao
Zhejiang Normal University
DatabaseInternet of Things
Yuchen Huang
Yuchen Huang
University of Michigan - Ann Arbor
AI InterpretabilityMachine LearningNeural SystemsUbiquitous Computing
Y
Yutong Xu
ZTE Corporation
Guojie Luo
Guojie Luo
Peking University
Electronic Design AutomationReconfigurable Architecture
X
Xiang Chen
School of Computer Science, Peking University