🤖 AI Summary
Robot manipulation faces a fundamental trade-off between high-frequency execution and high-level reasoning, while existing dual-system architectures fail to effectively leverage pre-trained knowledge from vision-language models (VLMs) in their fast-execution System 1. Method: This paper proposes a unified vision-language-action (VLA) foundation model, introducing the novel “Fast-in-Slow” paradigm—deeply embedding a high-speed execution module (System 1) into a low-frequency, VLM-based reasoning framework (System 2). It achieves intra-model co-optimization of execution and reasoning via partial parameter sharing and asynchronous multimodal input modeling. A dual-perception collaborative training strategy enables end-to-end closed-loop control in both simulation and real-world settings. Contribution/Results: Experiments demonstrate average success rate improvements of 8% (simulation) and 11% (real-world) over prior state-of-the-art methods, with a control frequency of 117.7 Hz (chunk size = 8), substantially advancing performance and efficiency.
📝 Abstract
Generalized policy and execution efficiency constitute the two critical challenges in robotic manipulation. While recent foundation policies benefit from the common-sense reasoning capabilities of internet-scale pretrained vision-language models (VLMs), they often suffer from low execution frequency. To mitigate this dilemma, dual-system approaches, inspired by Kahneman's theory, have been proposed to leverage a VLM-based System 2 model handling high-level reasoning and a separate System 1 action model ensuring real-time control. However, existing designs maintain both systems as separate models, limiting System 1 from fully leveraging the rich pretrained knowledge from the VLM-based System 2. In this work, we propose Fast-in-Slow (FiS), a unified dual-system vision-language-action (VLA) model that embeds the System 1 execution module within the VLM-based System 2 by partially sharing parameters. This innovative paradigm not only enables high-frequency execution in System 1 but also facilitates coordination between the reasoning and execution components within a single foundation model of System 2. Given their fundamentally distinct roles within FiS-VLA, we design the two systems to incorporate heterogeneous modality inputs alongside asynchronous operating frequencies, enabling both fast and precise manipulation. To enable coordination between the two systems, a dual-aware co-training strategy is proposed that equips System 1 with action generation capabilities while preserving System 2's contextual reasoning representation. For evaluation, FiS-VLA outperforms previous state-of-the-art methods by 8% in simulation and 11% in real-world tasks in terms of average success rate, while achieving a 117.7 Hz control frequency with action chunk set to eight. Project web page: fast-in-slow.github.io.