Galaxea Open-World Dataset and G0 Dual-System VLA Model

📅 2025-08-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of task complexity and limited generalization in real-world robotic manipulation, this paper introduces Galaxea—the first large-scale, highly diverse robot behavior dataset covering comprehensive domestic and industrial scenarios. We propose the G0 end-to-end dual-system framework: an upper-layer vision-language model (VLM) for task planning and reasoning, and a lower-layer vision-language-action model (VLA) for fine-grained action execution. A novel three-stage curriculum learning paradigm—cross-embodiment → single-embodiment → task-specific post-training—is introduced; notably, single-embodiment pretraining leverages rich real-world data to substantially improve long-horizon task performance and few-shot generalization. Extensive evaluation on benchmarks—including desktop manipulation, mobile manipulation, and long-horizon tasks—demonstrates consistent state-of-the-art performance, validating both the dataset’s representativeness and the framework’s effectiveness and scalability.

Technology Category

Application Category

📝 Abstract
We present Galaxea Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation. Building on this dataset, we introduce G0, a dual-system framework that couples a Vision-Language Model (VLM) for multimodal planning with a Vision-Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training. A comprehensive benchmark spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation, demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxea Open-World Dataset, plays a critical role in achieving strong performance.
Problem

Research questions and friction points this paper is trying to address.

Creating a large-scale dataset of robot behaviors in human environments
Developing a dual-system model for multimodal planning and execution
Addressing performance in tabletop and mobile manipulation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-system VLA model for planning and execution
Three-stage curriculum training methodology
Open-world dataset with precise language annotations
🔎 Similar Papers
No similar papers found.