Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Current Vision-Language-Action (VLA) systems enforce synchronous execution between vision-language models (VLMs) and action modules, severely limiting real-time performance and control stability for whole-body robotic manipulation due to large-model inference latency. To address this, we propose DuoCore-FS—the first asynchronous dual-path VLA framework featuring fast and slow processing streams. It introduces a latent-variable buffer and a whole-body action tokenizer to enable end-to-end decoupled training of semantic understanding (using a 3B-parameter VLM) and high-frequency action generation (at 30 Hz). This architecture allows low-frequency semantic updates from the VLM to coexist with high-frequency responses from a dedicated action expert, effectively increasing control bandwidth by 3×. In real-robot experiments, DuoCore-FS achieves significantly higher task success rates and faster response times compared to synchronous baselines.

Technology Category

Application Category

📝 Abstract

Most Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals, yet both typically run at a single unified frequency. As a result, policy performance is constrained by the low inference speed of large VLMs. This mandatory synchronous execution severely limits control stability and real-time performance in whole-body robotic manipulation, which involves more joints, larger motion spaces, and dynamically changing views. We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS), organizing the system into a fast pathway for high-frequency action generation and a slow pathway for rich VLM reasoning. The system is characterized by two key features. First, a latent representation buffer bridges the slow and fast systems. It stores instruction semantics and action-reasoning representation aligned with the scene-instruction context, providing high-level guidance to the fast pathway. Second, a whole-body action tokenizer provides a compact, unified representation of whole-body actions. Importantly, the VLM and action expert are still jointly trained end-to-end, preserving unified policy learning while enabling asynchronous execution. DuoCore-FS supports a 3B-parameter VLM while achieving 30 Hz whole-body action-chunk generation, approximately three times as fast as prior VLA models with comparable model sizes. Real-world whole-body manipulation experiments demonstrate improved task success rates and significantly enhanced responsiveness compared to synchronous Fast-Slow VLA baselines. The implementation of DuoCore-FS, including training, inference, and deployment, is provided to commercial users by Astribot as part of the Astribot robotic platform.

Problem

Research questions and friction points this paper is trying to address.

Asynchronous framework improves robotic manipulation speed

Separates fast action from slow semantic reasoning

Enhances real-time performance and control stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous fast-slow pathways for high-frequency action generation

Latent representation buffer bridges slow semantic reasoning and fast execution

Whole-body action tokenizer provides compact unified representation for manipulation

🔎 Similar Papers

No similar papers found.