🤖 AI Summary
This work addresses key limitations of general-purpose robotic policies—namely, poor generalization to novel objects, unseen environments, and abstract instructions, as well as inadequate adaptability to long-horizon dexterous manipulation (e.g., bimanual coordination and mobile manipulation). To this end, we propose GR-3, a large-scale vision-language-action foundation model. Methodologically, GR-3 employs a multi-stage collaborative training framework integrating: (i) network-scale vision-language pretraining, (ii) imitation learning from human demonstrations captured in VR, and (iii) efficient fine-tuning on real-robot trajectory data; it is deployed on our custom bimanual mobile platform, ByteMini. Experiments demonstrate that GR-3 significantly outperforms the current state-of-the-art policy π₀ across multiple challenging real-world tasks, exhibiting strong cross-scenario generalization, extended-horizon planning capability, and robust dexterous manipulation performance. These results establish a scalable technical pathway toward everyday-assistant–oriented general embodied intelligence.
📝 Abstract
We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, $π_0$, on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.