VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

To address high inference latency and excessive computational redundancy in Vision-Language-Action (VLA) models for real-time robotic manipulation, this paper proposes an adaptive visual token selection and KV caching mechanism tailored for VLA models. Leveraging the high visual similarity across consecutive action frames, our method dynamically identifies invariant tokens and reuses their key-value (KV) states to enable cross-step computation sharing. It integrates incremental visual difference detection, dynamic KV-cache updating, and reloading within a unified Transformer-based joint vision-language-action modeling framework—preserving action prediction accuracy. Evaluated on LIBERO, SIMPLER simulation, and real-robot tasks, our approach achieves 1.8–2.3× inference speedup with only a negligible (<1.2%) drop in task success rate. To the best of our knowledge, this is the first work to realize efficient, low-degradation sequential decision acceleration for VLA models.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) model can process instructions and visual perception to directly generate actions as output in an end-to-end fashion due to its strong multi-modal reasoning capabilities. While the performance of VLA models is promising, their computational cost can be substantial. This raises challenge for applying them on robotics tasks, which requires real-time decision-making to respond quickly to environmental changes. Since robotic control involves sequential decision-making, the visual input often exhibits minimal variation between successive steps. A natural idea is to reuse the computational results of unchanged visual tokens from the last step. Motivated by this idea, we propose VLA-Cache, an efficient vision-language-action model. VLA-Cache incorporates a token-selection mechanism that compares the visual input at each step with the input from the previous step, adaptively identifying visual tokens with minimal changes. The computational results for these unchanged tokens are then reused in subsequent steps via KV-cache, thereby significantly improving the efficiency of the VLA-Cache model. Experimental results on both simulation (e.g., LIBERO benchmark and SIMPLER) and real-world robot valid VLA-Cache can achieve practical acceleration with minimal sacrifice in success rate.

Problem

Research questions and friction points this paper is trying to address.

Efficient Vision-Language-Action model

Adaptive Token Caching

Real-time robotic manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive token caching mechanism

Reuse unchanged visual tokens

KV-cache for efficient computation

🔎 Similar Papers

TinyVLA: Toward Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation