GR-3 Technical Report

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses key limitations of general-purpose robotic policies—namely, poor generalization to novel objects, unseen environments, and abstract instructions, as well as inadequate adaptability to long-horizon dexterous manipulation (e.g., bimanual coordination and mobile manipulation). To this end, we propose GR-3, a large-scale vision-language-action foundation model. Methodologically, GR-3 employs a multi-stage collaborative training framework integrating: (i) network-scale vision-language pretraining, (ii) imitation learning from human demonstrations captured in VR, and (iii) efficient fine-tuning on real-robot trajectory data; it is deployed on our custom bimanual mobile platform, ByteMini. Experiments demonstrate that GR-3 significantly outperforms the current state-of-the-art policy π₀ across multiple challenging real-world tasks, exhibiting strong cross-scenario generalization, extended-horizon planning capability, and robust dexterous manipulation performance. These results establish a scalable technical pathway toward everyday-assistant–oriented general embodied intelligence.

Technology Category

Application Category

📝 Abstract

We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, $π_0$, on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.

Problem

Research questions and friction points this paper is trying to address.

Developing GR-3 for generalist robot policies

Enhancing generalization to novel objects and environments

Achieving efficient fine-tuning with minimal human data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale vision-language-action model

Efficient fine-tuning with VR data

Versatile bi-manual mobile robot integration

🔎 Similar Papers

No similar papers found.