Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation

πŸ“… 2025-12-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current Vision-Language-Action (VLA) systems enforce synchronous execution between vision-language models (VLMs) and action modules, severely limiting real-time performance and control stability for whole-body robotic manipulation due to large-model inference latency. To address this, we propose DuoCore-FSβ€”the first asynchronous dual-path VLA framework featuring fast and slow processing streams. It introduces a latent-variable buffer and a whole-body action tokenizer to enable end-to-end decoupled training of semantic understanding (using a 3B-parameter VLM) and high-frequency action generation (at 30 Hz). This architecture allows low-frequency semantic updates from the VLM to coexist with high-frequency responses from a dedicated action expert, effectively increasing control bandwidth by 3Γ—. In real-robot experiments, DuoCore-FS achieves significantly higher task success rates and faster response times compared to synchronous baselines.

Technology Category

Application Category

πŸ“ Abstract
Most Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals, yet both typically run at a single unified frequency. As a result, policy performance is constrained by the low inference speed of large VLMs. This mandatory synchronous execution severely limits control stability and real-time performance in whole-body robotic manipulation, which involves more joints, larger motion spaces, and dynamically changing views. We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS), organizing the system into a fast pathway for high-frequency action generation and a slow pathway for rich VLM reasoning. The system is characterized by two key features. First, a latent representation buffer bridges the slow and fast systems. It stores instruction semantics and action-reasoning representation aligned with the scene-instruction context, providing high-level guidance to the fast pathway. Second, a whole-body action tokenizer provides a compact, unified representation of whole-body actions. Importantly, the VLM and action expert are still jointly trained end-to-end, preserving unified policy learning while enabling asynchronous execution. DuoCore-FS supports a 3B-parameter VLM while achieving 30 Hz whole-body action-chunk generation, approximately three times as fast as prior VLA models with comparable model sizes. Real-world whole-body manipulation experiments demonstrate improved task success rates and significantly enhanced responsiveness compared to synchronous Fast-Slow VLA baselines. The implementation of DuoCore-FS, including training, inference, and deployment, is provided to commercial users by Astribot as part of the Astribot robotic platform.
Problem

Research questions and friction points this paper is trying to address.

Asynchronous framework improves robotic manipulation speed
Separates fast action from slow semantic reasoning
Enhances real-time performance and control stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous fast-slow pathways for high-frequency action generation
Latent representation buffer bridges slow semantic reasoning and fast execution
Whole-body action tokenizer provides compact unified representation for manipulation
πŸ”Ž Similar Papers
No similar papers found.
T
Teqiang Zou
Astribot
H
Hongliang Zeng
Astribot
Y
Yuxuan Nong
Astribot
Y
Yifan Li
Astribot
K
Kehui Liu
Astribot
Haotian Yang
Haotian Yang
Kuaishou Technology
computer visioncomputer graphics
X
Xinyang Ling
Astribot
X
Xin Li
Astribot
L
Lianyang Ma
Astribot