🤖 AI Summary
Large Vision-Language-Action (VLA) models face significant challenges in meeting real-time robotic control requirements due to high computational latency. To address this, we propose a fully streaming inference framework optimized for low-latency deployment. Our approach integrates multi-view input co-modeling, computational graph pruning, memory reuse, and streaming token scheduling—collectively enabling lightweight execution. To our knowledge, this is the first work achieving 30 Hz video frame processing and 480 Hz trajectory generation on a single consumer-grade GPU. The framework drastically reduces end-to-end inference latency, enabling closed-loop control for highly dynamic tasks. In physical experiments, the pi0 policy built upon our framework achieves 100% success in high-speed grasping of a falling pen, demonstrating both feasibility and robustness of large VLAs in stringent real-time control settings. This work establishes critical infrastructure for transitioning VLAs from offline decision-making to online, embodied intelligence.
📝 Abstract
In this paper, we show how to run pi0-level multi-view VLA at 30Hz frame rate and at most 480Hz trajectory frequency using a single consumer GPU. This enables dynamic and real-time tasks that were previously believed to be unattainable by large VLA models. To achieve it, we introduce a bag of strategies to eliminate the overheads in model inference. The real-world experiment shows that the pi0 policy with our strategy achieves a 100% success rate in grasping a falling pen task. Based on the results, we further propose a full streaming inference framework for real-time robot control of VLA. Code is available at https://github.com/Dexmal/realtime-vla.