Monocular One-Shot Metric-Depth Alignment for RGB-Based Robot Grasping

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Monocular RGB images inherently lack metric depth information, hindering accurate 6D grasp pose estimation for robotic manipulation. To address this, we propose a geometry-aware alignment framework that requires no depth sensors, additional data collection, or model retraining. Leveraging a monocular depth estimation model (MDEM), our method achieves joint scale, rotation, and translation calibration in a single calibration step, performing end-to-end geometric alignment under camera projection constraints using only sparse ground-truth depth points—while additionally supporting fine-tuning for transparent objects. Evaluated on tabletop two-finger grasping and suction-based bin-picking tasks, the system achieves high grasp success rates, demonstrating strong generalization and real-world deployment efficacy. Our key contribution is the first formulation of full-parameter geometric alignment—scale, rotation, and translation—under single-step, sparse supervision, uniquely balancing accuracy, robustness, and deployment practicality.

Technology Category

Application Category

📝 Abstract

Accurate 6D object pose estimation is a prerequisite for successfully completing robotic prehensile and non-prehensile manipulation tasks. At present, 6D pose estimation for robotic manipulation generally relies on depth sensors based on, e.g., structured light, time-of-flight, and stereo-vision, which can be expensive, produce noisy output (as compared with RGB cameras), and fail to handle transparent objects. On the other hand, state-of-the-art monocular depth estimation models (MDEMs) provide only affine-invariant depths up to an unknown scale and shift. Metric MDEMs achieve some successful zero-shot results on public datasets, but fail to generalize. We propose a novel framework, Monocular One-shot Metric-depth Alignment (MOMA), to recover metric depth from a single RGB image, through a one-shot adaptation building on MDEM techniques. MOMA performs scale-rotation-shift alignments during camera calibration, guided by sparse ground-truth depth points, enabling accurate depth estimation without additional data collection or model retraining on the testing setup. MOMA supports fine-tuning the MDEM on transparent objects, demonstrating strong generalization capabilities. Real-world experiments on tabletop 2-finger grasping and suction-based bin-picking applications show MOMA achieves high success rates in diverse tasks, confirming its effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Estimating accurate 6D object pose from RGB images

Overcoming limitations of noisy depth sensors and affine-invariant depths

Enabling metric depth recovery for robotic grasping tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monocular one-shot metric-depth alignment

Scale-rotation-shift alignments calibration

Fine-tuning MDEM for transparent objects

🔎 Similar Papers

GoalGrasp: Grasping Goals in Partially Occluded Scenarios without Grasp Training

2024-05-08arXiv.orgCitations: 0

KineDepth: Utilizing Robot Kinematics for Online Metric Depth Estimation

2024-09-29arXiv.orgCitations: 0