Galaxea Open-World Dataset and G0 Dual-System VLA Model

📅 2025-08-30

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

To address the challenges of task complexity and limited generalization in real-world robotic manipulation, this paper introduces Galaxea—the first large-scale, highly diverse robot behavior dataset covering comprehensive domestic and industrial scenarios. We propose the G0 end-to-end dual-system framework: an upper-layer vision-language model (VLM) for task planning and reasoning, and a lower-layer vision-language-action model (VLA) for fine-grained action execution. A novel three-stage curriculum learning paradigm—cross-embodiment → single-embodiment → task-specific post-training—is introduced; notably, single-embodiment pretraining leverages rich real-world data to substantially improve long-horizon task performance and few-shot generalization. Extensive evaluation on benchmarks—including desktop manipulation, mobile manipulation, and long-horizon tasks—demonstrates consistent state-of-the-art performance, validating both the dataset’s representativeness and the framework’s effectiveness and scalability.

Technology Category

Application Category

📝 Abstract

We present Galaxea Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation. Building on this dataset, we introduce G0, a dual-system framework that couples a Vision-Language Model (VLM) for multimodal planning with a Vision-Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training. A comprehensive benchmark spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation, demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxea Open-World Dataset, plays a critical role in achieving strong performance.

Problem

Research questions and friction points this paper is trying to address.

Creating a large-scale dataset of robot behaviors in human environments

Developing a dual-system model for multimodal planning and execution

Addressing performance in tabletop and mobile manipulation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-system VLA model for planning and execution

Three-stage curriculum training methodology

Open-world dataset with precise language annotations

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Robotics AI Engineer Sr. Staff/Principal Engineer – Embodied AI/Vision Language Action Models

Qualcomm

$221,600.00 - $332,400.00

Santa Clara, California, United States of America / San Diego, California, United States of America

Authors to Follow