🤖 AI Summary
This work addresses the limitations of vision-language-action (VLA) models in long-horizon tasks, particularly their constrained context length and inefficient inference. To overcome these challenges, the authors propose a static-dynamic decoupled VLA framework (SD-VLA), which decomposes visual inputs into static and dynamic multi-level tokens. By retaining only a single copy of static tokens and reusing their key-value (KV) cache across frames, combined with a lightweight recaching gating mechanism, the method enables efficient multi-frame fusion. The study also introduces the first benchmark specifically designed for evaluating long-horizon dependency modeling in VLA systems. Experimental results demonstrate that SD-VLA achieves a 39.8% improvement in success rate on the new benchmark and a 3.9% gain on SimplerEnv, while accelerating inference by 2.26× compared to baseline approaches.
📝 Abstract
Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: limited long-horizon context and inefficient inference due to the quadratic attention complexity and large parameter counts. Our work is motivated by the observation that much of the visual information in a trajectory remains static across timesteps (e.g., the background). Leveraging this property, we propose SD-VLA, a framework that disentangles visual inputs into multi-level static and dynamic tokens, which enables (1) retaining a single copy of static tokens across frames to significantly reduce context length, and (2) reusing the key-value (KV) cache of static tokens through a lightweight recache gate that updates only when necessary. This design enables efficient multi-frame integration and efficient inference. In addition, we introduce a new benchmark that more effectively evaluates the long-horizon temporal dependency modeling ability of VLAs. Experimental results show that our approach outperforms baselines on this benchmark by 39.8% absolute improvement in success rate, and achieves a 3.9% gain on the SimplerEnv benchmark. Moreover, SD-VLA delivers a 2.26x inference speedup over the base VLA model on the same benchmark, enabling faster and more practical real-world deployment.