ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
VLA models suffer from high inference latency (only 3–5 Hz) on edge devices due to memory bottlenecks induced by autoregressive decoding—far below the 20–30 Hz required for real-time robotic control. To address this, we propose the first system-level acceleration framework tailored for real-time VLA inference on edge platforms. Our approach introduces cross-request pipelined scheduling, reformulating VLA decoding as a macro-pipeline; pioneers cross-request state-packed forward operators and a unified KV circular cache to overcome GPU memory constraints; and synergistically optimizes the heterogeneous prefill and decode phases via micro-batching. Evaluated on OpenVLA-7B, our framework achieves a 2.55× FPS improvement with zero retraining, enabling—for the first time on edge hardware—sustained >20 Hz dynamic operation for real-time robotic control.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hin dered by high inference latency. While smooth robotic interaction requires control frequencies of 20 to 30 Hz, current VLA models typi cally operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms. At the core of ActionFlow is a Cross-Request Pipelin ing strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55x improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dy namic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.
Problem

Research questions and friction points this paper is trying to address.

Reduces high inference latency in Vision-Language-Action models on edge devices
Enables real-time robotic control without retraining or accuracy loss
Optimizes hardware utilization via pipelining memory-bound and compute-bound phases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Request Pipelining strategy for hardware utilization
Cross-Request State Packed Forward operator for dense computations
Unified KV Ring Buffer to fuse memory operations
🔎 Similar Papers
No similar papers found.
Y
Yuntao Dai
School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
H
Hang Gu
School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
T
Teng Wang
Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China
Qianyu Cheng
Qianyu Cheng
University of Science and Technology of China
Analytical ProcessingNear-Storage ComputingDomain-Specific ArchitectureNon-Relational Database
Y
Yifei Zheng
School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
Z
Zhiyong Qiu
IEIT SYSTEMS Co., Ltd., Beijing, China
L
Lei Gong
School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
Wenqi Lou
Wenqi Lou
University of Science and Technology of China
FPGA AcceleratorAlgorithm-hardware Co-Optimization
X
Xuehai Zhou
School of Computer Science and Technology, University of Science and Technology of China, Hefei, China; Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China