Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Vision-Language-Action (VLA) models hold significant promise for general-purpose robotic control, yet their high computational cost and autoregressive decoding hinder real-time deployment. This paper introduces FlashVLA—a training-free, plug-and-play acceleration framework addressing two key bottlenecks: unstable action sequences and redundant visual tokens. Our contributions are threefold: (1) a novel action reuse mechanism that caches actions based on token importance scoring; (2) an information-guided dynamic visual token pruning strategy to eliminate redundancy while preserving task-critical features; and (3) an adaptive stable-step detection module to enhance inference robustness under varying dynamics. Evaluated on the LIBERO benchmark, FlashVLA reduces FLOPs by 55.7% and end-to-end latency by 36.0%, with only a marginal 0.7% drop in task success rate—achieving, for the first time, efficient VLA inference without any fine-tuning.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions. However, their high inference cost-stemming from large-scale token computation and autoregressive decoding-poses significant challenges for real-time deployment and edge applications. While prior work has primarily focused on architectural optimization, we take a different perspective by identifying a dual form of redundancy in VLA models: (i) high similarity across consecutive action steps, and (ii) substantial redundancy in visual tokens. Motivated by these observations, we propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models. FlashVLA improves inference efficiency through a token-aware action reuse mechanism that avoids redundant decoding across stable action steps, and an information-guided visual token selection strategy that prunes low-contribution tokens. Extensive experiments on the LIBERO benchmark show that FlashVLA reduces FLOPs by 55.7% and latency by 36.0%, with only a 0.7% drop in task success rate. These results demonstrate the effectiveness of FlashVLA in enabling lightweight, low-latency VLA inference without retraining.

Problem

Research questions and friction points this paper is trying to address.

Reduces high inference cost in Vision-Language-Action models

Addresses redundancy in visual tokens and action steps

Enables efficient real-time deployment without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-aware action reuse mechanism

Information-guided visual token pruning

Training-free plug-and-play acceleration framework

🔎 Similar Papers

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference