Running VLAs at Real-time Speed

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Large Vision-Language-Action (VLA) models face significant challenges in meeting real-time robotic control requirements due to high computational latency. To address this, we propose a fully streaming inference framework optimized for low-latency deployment. Our approach integrates multi-view input co-modeling, computational graph pruning, memory reuse, and streaming token scheduling—collectively enabling lightweight execution. To our knowledge, this is the first work achieving 30 Hz video frame processing and 480 Hz trajectory generation on a single consumer-grade GPU. The framework drastically reduces end-to-end inference latency, enabling closed-loop control for highly dynamic tasks. In physical experiments, the pi0 policy built upon our framework achieves 100% success in high-speed grasping of a falling pen, demonstrating both feasibility and robustness of large VLAs in stringent real-time control settings. This work establishes critical infrastructure for transitioning VLAs from offline decision-making to online, embodied intelligence.

Technology Category

Application Category

📝 Abstract

In this paper, we show how to run pi0-level multi-view VLA at 30Hz frame rate and at most 480Hz trajectory frequency using a single consumer GPU. This enables dynamic and real-time tasks that were previously believed to be unattainable by large VLA models. To achieve it, we introduce a bag of strategies to eliminate the overheads in model inference. The real-world experiment shows that the pi0 policy with our strategy achieves a 100% success rate in grasping a falling pen task. Based on the results, we further propose a full streaming inference framework for real-time robot control of VLA. Code is available at https://github.com/Dexmal/realtime-vla.

Problem

Research questions and friction points this paper is trying to address.

Achieving real-time visual-language action model execution

Eliminating inference overhead for dynamic robot tasks

Enabling high-frequency streaming control with VLAs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Achieving 30Hz VLA inference on consumer GPU

Eliminating model overhead via optimization strategies

Implementing streaming framework for robot control

🔎 Similar Papers

No similar papers found.