🤖 AI Summary
To address the challenges of task complexity and limited generalization in real-world robotic manipulation, this paper introduces Galaxea—the first large-scale, highly diverse robot behavior dataset covering comprehensive domestic and industrial scenarios. We propose the G0 end-to-end dual-system framework: an upper-layer vision-language model (VLM) for task planning and reasoning, and a lower-layer vision-language-action model (VLA) for fine-grained action execution. A novel three-stage curriculum learning paradigm—cross-embodiment → single-embodiment → task-specific post-training—is introduced; notably, single-embodiment pretraining leverages rich real-world data to substantially improve long-horizon task performance and few-shot generalization. Extensive evaluation on benchmarks—including desktop manipulation, mobile manipulation, and long-horizon tasks—demonstrates consistent state-of-the-art performance, validating both the dataset’s representativeness and the framework’s effectiveness and scalability.
📝 Abstract
We present Galaxea Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation. Building on this dataset, we introduce G0, a dual-system framework that couples a Vision-Language Model (VLM) for multimodal planning with a Vision-Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training. A comprehensive benchmark spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation, demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxea Open-World Dataset, plays a critical role in achieving strong performance.