Running VLAs at Real-time Speed

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language-Action (VLA) models face significant challenges in meeting real-time robotic control requirements due to high computational latency. To address this, we propose a fully streaming inference framework optimized for low-latency deployment. Our approach integrates multi-view input co-modeling, computational graph pruning, memory reuse, and streaming token scheduling—collectively enabling lightweight execution. To our knowledge, this is the first work achieving 30 Hz video frame processing and 480 Hz trajectory generation on a single consumer-grade GPU. The framework drastically reduces end-to-end inference latency, enabling closed-loop control for highly dynamic tasks. In physical experiments, the pi0 policy built upon our framework achieves 100% success in high-speed grasping of a falling pen, demonstrating both feasibility and robustness of large VLAs in stringent real-time control settings. This work establishes critical infrastructure for transitioning VLAs from offline decision-making to online, embodied intelligence.

Technology Category

Application Category

📝 Abstract
In this paper, we show how to run pi0-level multi-view VLA at 30Hz frame rate and at most 480Hz trajectory frequency using a single consumer GPU. This enables dynamic and real-time tasks that were previously believed to be unattainable by large VLA models. To achieve it, we introduce a bag of strategies to eliminate the overheads in model inference. The real-world experiment shows that the pi0 policy with our strategy achieves a 100% success rate in grasping a falling pen task. Based on the results, we further propose a full streaming inference framework for real-time robot control of VLA. Code is available at https://github.com/Dexmal/realtime-vla.
Problem

Research questions and friction points this paper is trying to address.

Achieving real-time visual-language action model execution
Eliminating inference overhead for dynamic robot tasks
Enabling high-frequency streaming control with VLAs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Achieving 30Hz VLA inference on consumer GPU
Eliminating model overhead via optimization strategies
Implementing streaming framework for robot control
🔎 Similar Papers
No similar papers found.
Y
Yunchao Ma
Dexmal
Y
Yizhuang Zhou
StepFun
Y
Yunhuan Yang
Dexmal
Tiancai Wang
Tiancai Wang
Dexmal
Computer VisionEmbodied AI
Haoqiang Fan
Haoqiang Fan
Megvii
computer vision