π€ AI Summary
This work addresses the challenge of generating physically plausible 6-DoF grasps from RGB inputs, a task hindered by the lack of precise geometric representation in existing methods. The authors propose a novel depth-sensor-free framework that leverages only two-view RGB images and known camera parameters to reconstruct a dense, metrically scaled, and multi-view consistent point cloud using a 3D foundation model. Stable grasp poses are then generated directly from this reconstructed geometry. To the best of the authorsβ knowledge, this is the first approach to achieve high-quality, metrically accurate, and geometrically consistent grasping under sparse RGB observations, effectively overcoming the geometric fidelity limitations of conventional RGB-based methods. Experiments demonstrate state-of-the-art performance on both the GraspNet-1Billion benchmark and real-world scenarios among RGB-only 6-DoF grasping approaches.
π Abstract
Single-view RGB-D grasp detection remains a com- mon choice in 6-DoF robotic grasping systems, which typically requires a depth sensor. While RGB-only 6-DoF grasp methods has been studied recently, their inaccurate geometric repre- sentation is not directly suitable for physically reliable robotic manipulation, thereby hindering reliable grasp generation. To address these limitations, we propose MG-Grasp, a novel depth- free 6-DoF grasping framework that achieves high-quality object grasping. Leveraging two-view 3D foundation model with camera intrinsic/extrinsic, our method reconstructs metric- scale and multi-view consistent dense point clouds from sparse RGB images and generates stable 6-DoF grasp. Experiments on GraspNet-1Billion dataset and real world demonstrate that MG-Grasp achieves state-of-the-art (SOTA) grasp performance among RGB-based 6-DoF grasping methods.