AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Traditional vision-language-action (VLA) models rely on synchronous flow matching (SFM) with fixed-time-step scheduling, lacking explicit action-context modeling and self-correction capabilities—leading to error accumulation and poor robustness in long-horizon tasks. To address this, we propose AsyncVLA, the first VLA framework built upon asynchronous flow matching (AFM), enabling non-uniform temporal scheduling and context-aware action generation. We further introduce a confidence-driven selective self-correction mechanism that dynamically identifies erroneous predictions and triggers targeted re-generation. Additionally, we design a unified training paradigm compatible with both synchronous and asynchronous inference, improving KV cache utilization and data efficiency. Evaluated across multiple robotic manipulation benchmarks, AsyncVLA achieves significant gains in long-horizon task success rates, establishing new state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and uniform time schedules, i.e., synchronous FM (SFM). Without action context awareness and asynchronous self-correction, SFM becomes unstable in long-horizon tasks, where a single action error can cascade into failure. In this work, we propose asynchronous flow matching VLA (AsyncVLA), a novel framework that introduces temporal flexibility in asynchronous FM (AFM) and enables self-correction in action generation. AsyncVLA breaks from the vanilla SFM in VLA models by generating the action tokens in a non-uniform time schedule with action context awareness. Besides, our method introduces the confidence rater to extract confidence of the initially generated actions, enabling the model to selectively refine inaccurate action tokens before execution. Moreover, we propose a unified training procedure for SFM and AFM that endows a single model with both modes, improving KV-cache utilization. Extensive experiments on robotic manipulation benchmarks demonstrate that AsyncVLA is data-efficient and exhibits self-correction ability. AsyncVLA achieves state-of-the-art results across general embodied evaluations due to its asynchronous generation in AFM. Our code is available at https://github.com/YuhuaJiang2002/AsyncVLA.

Problem

Research questions and friction points this paper is trying to address.

Traditional VLA models use rigid synchronous flow matching without temporal flexibility

Synchronous flow matching becomes unstable in long-horizon robotic tasks

Lack of action context awareness leads to error cascading and failure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous flow matching with temporal flexibility

Confidence rater for selective action refinement

Unified training procedure for dual-mode capability

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs