The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and inference latency of Vision-Language-Action (VLA) models on resource-constrained platforms—caused by excessive visual tokens—this paper proposes LightVLA, the first end-to-end differentiable visual token pruning framework tailored for VLA tasks. Our method introduces a performance-driven dynamic query mechanism to assess token importance and employs Gumbel Softmax for adaptive, hyperparameter-free, zero-overhead token selection. Fully compatible with mainstream inference frameworks, LightVLA eliminates manual pruning ratio specification. Evaluated on the LIBERO benchmark, LightVLA reduces computational FLOPs by 59.1% and inference latency by 38.2%, while improving task success rate by 2.9%. This work establishes, for the first time, the efficacy and practicality of adaptive visual token pruning in VLA modeling.

Technology Category

Application Category

📝 Abstract
We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.9% improvement in task success rate. Meanwhile, we also investigate the learnable query-based token pruning method LightVLA* with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in vision-language-action models
Prunes uninformative visual tokens adaptively without heuristics
Improves task success rates while decreasing FLOPs and latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable token pruning for vision-language-action models
Adaptive performance-driven pruning of visual tokens
Gumbel softmax for differentiable token selection
🔎 Similar Papers
No similar papers found.
Titong Jiang
Titong Jiang
Tsinghua University
Xuefeng Jiang
Xuefeng Jiang
Institute of Computing Technology, Chinese Academy of Sciences
Weakly-supervised LearningDistributed OptimizationAutonomous DrivingNoisy Label Learning
Y
Yuan Ma
LiAuto Inc.
X
Xin Wen
LiAuto Inc.
B
Bailin Li
LiAuto Inc.
K
Kun Zhan
LiAuto Inc.
P
Peng Jia
LiAuto Inc.
Y
Yahui Liu
School of Vehicle and Mobility, Tsinghua University
S
Sheng Sun
Institute of Computing Technology, Chinese Academy of Sciences
X
Xianpeng Lang
LiAuto Inc.