π€ AI Summary
General-purpose robots face core challenges including shallow physical understanding, weak reasoning capabilities, and poor cross-platform control generalization. To address these, we propose a multimodal embodied intelligence framework integrating multimodal vision-language-action (VLA) models, embodied reasoning (ER) models, and motion transfer (MT) mechanisms to enable joint learning across heterogeneous robotic platforms. We innovatively introduce a natural-language-driven, hierarchical internal reasoning module that supports explicit βthink-then-actβ task decomposition and planning. This enhances spatial modeling fidelity and long-horizon task success rates. Experiments demonstrate a 32.7% improvement in success rate on complex multi-step manipulation tasks, alongside significantly improved behavioral interpretability. Our approach advances the integration of perception, reasoning, and execution toward practical deployment.
π Abstract
General-purpose robots need a deep understanding of the physical world, advanced reasoning, and general and dexterous control. This report introduces the latest generation of the Gemini Robotics model family: Gemini Robotics 1.5, a multi-embodiment Vision-Language-Action (VLA) model, and Gemini Robotics-ER 1.5, a state-of-the-art Embodied Reasoning (ER) model. We are bringing together three major innovations. First, Gemini Robotics 1.5 features a novel architecture and a Motion Transfer (MT) mechanism, which enables it to learn from heterogeneous, multi-embodiment robot data and makes the VLA more general. Second, Gemini Robotics 1.5 interleaves actions with a multi-level internal reasoning process in natural language. This enables the robot to "think before acting" and notably improves its ability to decompose and execute complex, multi-step tasks, and also makes the robot's behavior more interpretable to the user. Third, Gemini Robotics-ER 1.5 establishes a new state-of-the-art for embodied reasoning, i.e., for reasoning capabilities that are critical for robots, such as visual and spatial understanding, task planning, and progress estimation. Together, this family of models takes us a step towards an era of physical agents-enabling robots to perceive, think and then act so they can solve complex multi-step tasks.