π€ AI Summary
Existing depth sensors and 3D reconstruction methods struggle to meet the stringent requirements of robotic manipulation for high-fidelity, metrically consistent geometry. This work proposes Robo3Rβa feedforward, manipulation-ready 3D reconstruction model that jointly infers scale-invariant local geometry and relative camera poses, then aligns predictions into the robotβs coordinate frame via a learned global similarity transformation. The method introduces a novel masked point cloud head and a keypoint-driven Perspective-n-Point (PnP) refinement module, significantly enhancing reconstruction accuracy. Trained on Robo3R-4M, a large-scale synthetic dataset, the model consistently outperforms existing 3D reconstruction approaches and depth sensors across diverse downstream tasks, including imitation learning, sim-to-real transfer, grasp generation, and collision-free motion planning.
π Abstract
3D spatial perception is fundamental to generalizable robotic manipulation, yet obtaining reliable, high-quality 3D geometry remains challenging. Depth sensors suffer from noise and material sensitivity, while existing reconstruction models lack the precision and metric consistency required for physical interaction. We introduce Robo3R, a feed-forward, manipulation-ready 3D reconstruction model that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. Robo3R jointly infers scale-invariant local geometry and relative camera poses, which are unified into the scene representation in the canonical robot frame via a learned global similarity transformation. To meet the precision demands of manipulation, Robo3R employs a masked point head for sharp, fine-grained point clouds, and a keypoint-based Perspective-n-Point (PnP) formulation to refine camera extrinsics and global alignment. Trained on Robo3R-4M, a curated large-scale synthetic dataset with four million high-fidelity annotated frames, Robo3R consistently outperforms state-of-the-art reconstruction methods and depth sensors. Across downstream tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, we observe consistent gains in performance, suggesting the promise of this alternative 3D sensing module for robotic manipulation.