Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robot manipulation faces a fundamental trade-off between high-frequency execution and high-level reasoning, while existing dual-system architectures fail to effectively leverage pre-trained knowledge from vision-language models (VLMs) in their fast-execution System 1. Method: This paper proposes a unified vision-language-action (VLA) foundation model, introducing the novel “Fast-in-Slow” paradigm—deeply embedding a high-speed execution module (System 1) into a low-frequency, VLM-based reasoning framework (System 2). It achieves intra-model co-optimization of execution and reasoning via partial parameter sharing and asynchronous multimodal input modeling. A dual-perception collaborative training strategy enables end-to-end closed-loop control in both simulation and real-world settings. Contribution/Results: Experiments demonstrate average success rate improvements of 8% (simulation) and 11% (real-world) over prior state-of-the-art methods, with a control frequency of 117.7 Hz (chunk size = 8), substantially advancing performance and efficiency.

Technology Category

Application Category

📝 Abstract
Generalized policy and execution efficiency constitute the two critical challenges in robotic manipulation. While recent foundation policies benefit from the common-sense reasoning capabilities of internet-scale pretrained vision-language models (VLMs), they often suffer from low execution frequency. To mitigate this dilemma, dual-system approaches, inspired by Kahneman's theory, have been proposed to leverage a VLM-based System 2 model handling high-level reasoning and a separate System 1 action model ensuring real-time control. However, existing designs maintain both systems as separate models, limiting System 1 from fully leveraging the rich pretrained knowledge from the VLM-based System 2. In this work, we propose Fast-in-Slow (FiS), a unified dual-system vision-language-action (VLA) model that embeds the System 1 execution module within the VLM-based System 2 by partially sharing parameters. This innovative paradigm not only enables high-frequency execution in System 1 but also facilitates coordination between the reasoning and execution components within a single foundation model of System 2. Given their fundamentally distinct roles within FiS-VLA, we design the two systems to incorporate heterogeneous modality inputs alongside asynchronous operating frequencies, enabling both fast and precise manipulation. To enable coordination between the two systems, a dual-aware co-training strategy is proposed that equips System 1 with action generation capabilities while preserving System 2's contextual reasoning representation. For evaluation, FiS-VLA outperforms previous state-of-the-art methods by 8% in simulation and 11% in real-world tasks in terms of average success rate, while achieving a 117.7 Hz control frequency with action chunk set to eight. Project web page: fast-in-slow.github.io.
Problem

Research questions and friction points this paper is trying to address.

Unifying fast manipulation with slow reasoning in robotics
Improving execution efficiency in dual-system robotic models
Enhancing coordination between reasoning and execution components
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified dual-system VLA model with shared parameters
Dual-aware co-training for action and reasoning coordination
Heterogeneous inputs and asynchronous frequencies for precision
🔎 Similar Papers
No similar papers found.
H
Hao Chen
The Chinese University of Hong Kong
J
Jiaming Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Chenyang Gu
Chenyang Gu
Undergraduate, Peking University
Embodied AIRobotic Manipulation
Zhuoyang Liu
Zhuoyang Liu
Peking University
Embodied AIComputer Vision
Renrui Zhang
Renrui Zhang
Seed ByteDance & MMLab & PKU
Large Multimodal ModelGenerative ModelEmbodied AI
X
Xiaoqi Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
X
Xiao He
AI2Robotics
Y
Yandong Guo
AI2Robotics
C
Chi-Wing Fu
The Chinese University of Hong Kong
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models
P
P. Heng
The Chinese University of Hong Kong