StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

๐Ÿ“… 2026-03-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the high latency and frequent pauses inherent in traditional vision-language-action (VLA) models, which stem from their serial execution of perception, decision-making, and action stagesโ€”limiting their suitability for responsive, fluent operation on edge devices. To overcome this, the authors propose an asynchronous parallel streaming architecture that replaces conventional chunk-wise action denoising with action stream matching and integrates an adaptive early observation mechanism guided by action saliency. This design enables overlapping of observation, decision, and execution phases, effectively masking pipeline delays. Without compromising task performance, the approach reduces end-to-end latency by 2.4ร— and decreases execution stalls by 6.5ร—, substantially enhancing system responsiveness and interaction smoothness.
๐Ÿ“ Abstract
Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed sequentially, and wait for the completion of the preceding stage, the system suffers from frequent halting and high latency. To address this, We conduct a systematic analysis to identify the challenges for fast and fluent generation, and propose enabling VLAs with the ability to asynchronously parallelize across VLA stages in a "streaming" manner. First, we eliminate the reliance on action chunking and adopt action flow matching, which learns the trajectory of action flows rather than denoising chunk-wise actions. It overlaps the latency of action generation and execution. Second, we design an action saliency-aware adaptive observation mechanism, thereby overlapping the latency of execution and observation. Without sacrificing performance, StreamingVLA achieves substantial speedup and improves the fluency of execution. It achieves a 2.4 $\times$ latency speedup and reduces execution halting by 6.5 $\times$.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
latency
real-time execution
edge computing
action generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

StreamingVLA
Action Flow Matching
Adaptive Observation
Asynchronous Parallelization
Vision-Language-Action
Y
Yiran Shi
Tsinghua University
D
Dongqi Guo
Tsinghua University
Tianchen Zhao
Tianchen Zhao
Tsinghua University
EfficientMLModel CompressionVisual Generation
Feng Gao
Feng Gao
Tsinghua University
Reinforcement LearningRobot Learning
L
Liangzhi Shi
Tsinghua University
C
Chao Yu
Tsinghua University
Z
ZhiJian Mo
Lenovo Group Ltd.
Q
Qihua Xiao
Lenovo Group Ltd.
X
XiaoShuai Peng
Lenovo Group Ltd.
Q
Qingmin Liao
Tsinghua University
Y
Yu Wang
Tsinghua University