How Fast Can I Run My VLA? Demystifying VLA Inference Performance with VLA-Perf

📅 2026-02-20

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Vision-Language-Action (VLA) models face stringent real-time inference demands in real-world robotic deployment, yet their performance is intricately coupled with both model architecture and system configuration, lacking systematic understanding. This work proposes VLA-Perf, the first performance model capable of analytically predicting end-to-end latency for arbitrary combinations of VLA models and inference systems. By integrating analytical modeling, multidimensional experimentation, and joint hardware-network simulation, this study systematically characterizes the VLA inference performance landscape for the first time. The analysis yields 15 key design principles spanning model scaling, architectural choices, long-context video processing, asynchronous inference, and edge-cloud collaboration strategies, offering actionable guidance for designing efficient VLA systems.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have recently demonstrated impressive capabilities across various embodied AI tasks. While deploying VLA models on real-world robots imposes strict real-time inference constraints, the inference performance landscape of VLA remains poorly understood due to the large combinatorial space of model architectures and inference systems. In this paper, we ask a fundamental research question: How should we design future VLA models and systems to support real-time inference? To address this question, we first introduce VLA-Perf, an analytical performance model that can analyze inference performance for arbitrary combinations of VLA models and inference systems. Using VLA-Perf, we conduct the first systematic study of the VLA inference performance landscape. From a model-design perspective, we examine how inference performance is affected by model scaling, model architectural choices, long-context video inputs, asynchronous inference, and dual-system model pipelines. From the deployment perspective, we analyze where VLA inference should be executed -- on-device, on edge servers, or in the cloud -- and how hardware capability and network performance jointly determine end-to-end latency. By distilling 15 key takeaways from our comprehensive evaluation, we hope this work can provide practical guidance for the design of future VLA models and inference systems.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

real-time inference

inference performance

embodied AI

deployment constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLA-Perf

inference performance modeling

real-time VLA inference