Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation

📅 2024-10-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of enabling robotic systems to simultaneously achieve high adaptability, efficiency, broad applicability, and precise manipulation in diverse dynamic environments, this paper proposes RoboDual—a dual-system architecture introducing the novel “generalist–specialist co-design” paradigm. It integrates a vision-language-action (VLA) model for strong generalization with a lightweight diffusion Transformer policy (only 20M parameters) for real-time, data-efficient execution. The method combines multi-step receding-horizon action prediction and cross-modal data-driven training. Experiments demonstrate a 26.7% improvement in task success rate in real-world settings and a 12% gain on the CALVIN benchmark; robust performance is maintained using only 5% of demonstration data, and real-robot control frequency reaches 3.8× that of baselines. The core contribution lies in the first principled integration of general-purpose representation learning with lightweight, task-specialized execution—overcoming the long-standing trade-off between generalization capability and real-time responsiveness.

Technology Category

Application Category

📝 Abstract
The increasing demand for versatile robotic systems to operate in diverse and dynamic environments has emphasized the importance of a generalist policy, which leverages a large cross-embodiment data corpus to facilitate broad adaptability and high-level reasoning. However, the generalist would struggle with inefficient inference and cost-expensive training. The specialist policy, instead, is curated for specific domain data and excels at task-level precision with efficiency. Yet, it lacks the generalization capacity for a wide range of applications. Inspired by these observations, we introduce RoboDual, a synergistic dual-system that supplements the merits of both generalist and specialist policy. A diffusion transformer-based specialist is devised for multi-step action rollouts, exquisitely conditioned on the high-level task understanding and discretized action output of a vision-language-action (VLA) based generalist. Compared to OpenVLA, RoboDual achieves 26.7% improvement in real-world setting and 12% gain on CALVIN by introducing a specialist policy with merely 20M trainable parameters. It maintains strong performance with 5% of demonstration data only, and enables a 3.8 times higher control frequency in real-world deployment. Code would be made publicly available. Our project page is hosted at: https://opendrivelab.com/RoboDual/
Problem

Research questions and friction points this paper is trying to address.

Develop a dual-system for robotic manipulation
Enhance adaptability and efficiency in robotics
Combine generalist and specialist policies effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synergistic dual-system RoboDual
Diffusion transformer specialist
Vision-language-action generalist
Qingwen Bu
Qingwen Bu
HKU | OpenDriveLab
Robot LearningComputer VisionMachine Learning
H
Hongyang Li
The University of Hong Kong
L
Li Chen
The University of Hong Kong
J
Jisong Cai
Shanghai AI Lab
J
Jia Zeng
Shanghai AI Lab
Heming Cui
Heming Cui
University of Hong Kong
Operating SystemsProgramming LanguageDistributed SystemsSecurity
Maoqing Yao
Maoqing Yao
Google
Y
Yu Qiao
Shanghai AI Lab