VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Vision-Language-Action (VLA) models incur prohibitive computational overhead when processing continuous visual streams, hindering real-time deployment. Conventional semantic-saliency-based visual token pruning methods overlook the dual-system nature of VLA—integrating *semantic understanding* and *action execution*—leading to loss of action-critical information and degraded performance. To address this, we propose the first VLA-aware hierarchical token pruning framework: (1) a *semantic-level* module leverages vision-language pre-filling attention to assess high-level semantic importance; (2) an *action-level* module introduces temporally smoothed action-decoding attention to model low-level action relevance; and (3) an adaptive dual-level dynamic pruning strategy jointly optimizes both tiers. Evaluated across diverse VLA architectures and robotic tasks, our method achieves state-of-the-art performance while reducing FLOPs by an average of 42%, with action generation accuracy maintained or improved.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses inefficient visual token pruning in Vision-Language-Action models
Overcomes bias toward semantic cues neglecting action-critical information
Proposes dual-level pruning for semantic understanding and action execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-level importance criterion for token retention
Temporal smoothing estimates action decode attention
Adaptive token selection for semantic and action needs
🔎 Similar Papers
No similar papers found.
Z
Ziyan Liu
School of AI, Shanghai Jiao Tong University
Y
Yeqiu Chen
School of AI, Shanghai Jiao Tong University
Hongyi Cai
Hongyi Cai
University of Malaya
Data-centric AIAI for EfficiencyComputer Vision
T
Tao Lin
School of AI, Shanghai Jiao Tong University
S
Shuo Yang
Harbin Institute of Technology (Shenzhen)
Z
Zheng Liu
BAAI
B
Bo Zhao
School of AI, Shanghai Jiao Tong University