Learning to Accelerate Vision-Language-Action Models through Adaptive Visual Token Caching

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the high computational cost and lack of task-aware dynamic acceleration in existing Vision-Language-Action (VLA) models, which hinder real-time deployment. The authors propose a learnable inference acceleration strategy that introduces, for the first time, an adaptive, task-driven visual token caching mechanism into VLA architectures. This is achieved through two lightweight modules—a Cached Token Selector and a Cache Ratio Predictor—that jointly enable dynamic caching decisions during inference. The entire system is trained end-to-end using differentiable relaxation techniques. Experiments on the LIBERO and SIMPLER benchmarks, as well as on a physical robot, demonstrate that the method achieves a 1.76× speedup in inference latency while improving task success rates by 1.9 and 5.0 percentage points, respectively.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have demonstrated remarkable generalization capabilities in robotic manipulation tasks, yet their substantial computational overhead remains a critical obstacle to real-world deployment. Improving inference efficiency is therefore essential for practical robotic applications. Existing acceleration methods often rely on heuristic or static strategies--such as rule-based token caching or pruning--that are decoupled from task objectives and fail to adapt to dynamic scene changes. In this work, we reformulate inference acceleration as a learnable policy optimization problem and propose a novel framework that integrates a dynamic, task-aware decision-making process directly into the VLA model. At its core are two lightweight, cooperative modules: a Cached Token Selector, which determines which tokens should be reused, and a Cache Ratio Predictor, which controls how many tokens to reuse. Training these modules is non-trivial due to their discrete decisions. We address this by adopting a differentiable relaxation that allows gradient-based end-to-end optimization. Extensive experiments on the LIBERO and SIMPLER benchmarks, as well as real-robot evaluations, show that our method achieves a 1.76x wall-clock inference speedup while simultaneously improving the average success rate by 1.9 percentage points (from 75.0% to 76.9%) on LIBERO and by 5.0 percentage points on real-world tasks, significantly outperforming existing baselines. This work highlights the potential of learning task-aware computational allocation policies, paving the way for VLA models that are both powerful and efficient.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

inference efficiency

token caching

robotic manipulation

computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive token caching

vision-language-action models

learnable acceleration policy