🤖 AI Summary
This work addresses the limited capability of embodied intelligence in complex, fine-grained manipulation tasks—particularly in 3D spatial perception and temporal dynamics modeling—by proposing a foundational embodied AI model trained with large-scale spatiotemporal supervision. The method achieves, for the first time, absolute 3D coordinate prediction with metric depth from relative 2D pixel locations, integrating ordered keypoint sequence generation, absolute metric constraint understanding, dense temporal value estimation, and multi-view robust temporal modeling. This significantly enhances the model’s spatial reasoning accuracy and execution stability. The resulting system generates physically plausible, complete manipulation trajectories and provides fine-grained, viewpoint-invariant execution feedback, delivering reliable learning signals for downstream tasks.
📝 Abstract
We introduce RoboBrain 2.5, a next-generation embodied AI foundation model that advances general perception, spatial reasoning, and temporal modeling through extensive training on high-quality spatiotemporal supervision. Building upon its predecessor, RoboBrain 2.5 introduces two major capability upgrades. Specifically, it unlocks Precise 3D Spatial Reasoning by shifting from 2D pixel-relative grounding to depth-aware coordinate prediction and absolute metric constraint comprehension, generating complete 3D manipulation traces as ordered keypoint sequences under physical constraints. Complementing this spatial precision, the model establishes Dense Temporal Value Estimation that provides dense, step-aware progress prediction and execution state understanding across varying viewpoints, producing stable feedback signals for downstream learning. Together, these upgrades extend the framework toward more physically grounded and execution-aware embodied intelligence for complex, fine-grained manipulation. The code and checkpoints are available at project website: https://superrobobrain.github.io