π€ AI Summary
To address the high cost of real-world robotic data and limited generalization in generic Vision-Language-Action (VLA) models, this paper proposes a world modelβdriven data synthesis framework. Our method jointly reasons over spatial geometry, object states, and long-horizon dependencies via RGB-D input modeling and embodied Chain-of-Thought supervision. Leveraging a learned world model, we generate synthetic videos, multi-view observations, and sim-to-real transfer samples to support both vision-language pretraining and dexterous manipulation policy learning. The approach substantially reduces reliance on real robot data while maintaining strong real-world performance under significant variations in appearance, scene layout, and viewpoint. We further introduce GigaBrain-0-Small, a lightweight VLA model optimized for efficient deployment on edge devices such as the Jetson AGX Orin. Experimental results demonstrate improved data efficiency, robust cross-domain generalization, and practical applicability in resource-constrained robotic systems.
π Abstract
Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.