GigaBrain-0: A World Model-Powered Vision-Language-Action Model

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

To address the high cost of real-world robotic data and limited generalization in generic Vision-Language-Action (VLA) models, this paper proposes a world model–driven data synthesis framework. Our method jointly reasons over spatial geometry, object states, and long-horizon dependencies via RGB-D input modeling and embodied Chain-of-Thought supervision. Leveraging a learned world model, we generate synthetic videos, multi-view observations, and sim-to-real transfer samples to support both vision-language pretraining and dexterous manipulation policy learning. The approach substantially reduces reliance on real robot data while maintaining strong real-world performance under significant variations in appearance, scene layout, and viewpoint. We further introduce GigaBrain-0-Small, a lightweight VLA model optimized for efficient deployment on edge devices such as the Jetson AGX Orin. Experimental results demonstrate improved data efficiency, robust cross-domain generalization, and practical applicability in resource-constrained robotic systems.

Technology Category

Application Category

📝 Abstract

Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.

Problem

Research questions and friction points this paper is trying to address.

Reduces reliance on expensive real robot data collection

Improves cross-task generalization for vision-language-action models

Enhances policy robustness through RGBD input and reasoning supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses world model-generated data for training

Improves robustness with RGBD and CoT supervision

Achieves generalization across appearance and viewpoint variations

🔎 Similar Papers

No similar papers found.