🤖 AI Summary
Vision-language-action models (VLAs) suffer from high inference latency and discontinuous responses in real-world deployment, causing action stalls, delayed environmental feedback, and temporal misalignment—especially under asynchronous inference—leading to unstable actuation. To address this, we propose a future-state-aware mechanism that aligns prediction timing with execution timing by forward-projecting robot state, without modifying model architecture or incurring runtime overhead. Additionally, we introduce a state-rolling prediction method based on action chunks, which dynamically estimates the environment state at execution time to ensure tight synchronization between inference and actuation. Experiments demonstrate up to 2.03× inference acceleration, a 17.4× reduction in reaction latency, and preservation of original task accuracy. The approach has been successfully deployed in high-temporal-precision tasks, including table tennis and whack-a-mole.
📝 Abstract
Vision-Language-Action models (VLAs) are becoming increasingly capable across diverse robotic tasks. However, their real-world deployment remains slow and inefficient: demonstration videos are often sped up by 5-10x to appear smooth, with noticeable action stalls and delayed reactions to environmental changes. Asynchronous inference offers a promising solution to achieve continuous and low-latency control by enabling robots to execute actions and perform inference simultaneously. However, because the robot and environment continue to evolve during inference, a temporal misalignment arises between the prediction and execution intervals. This leads to significant action instability, while existing methods either degrade accuracy or introduce runtime overhead to mitigate it. We propose VLASH, a general asynchronous inference framework for VLAs that delivers smooth, accurate, and fast reaction control without additional overhead or architectural changes. VLASH estimates the future execution-time state by rolling the robot state forward with the previously generated action chunk, thereby bridging the gap between prediction and execution. Experiments show that VLASH achieves up to 2.03x speedup and reduces reaction latency by up to 17.4x compared to synchronous inference while fully preserving the original accuracy. Moreover, it empowers VLAs to handle fast-reaction, high-precision tasks such as playing ping-pong and playing whack-a-mole, where traditional synchronous inference fails. Code is available at https://github.com/mit-han-lab/vlash