RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Monocular metric depth estimation remains challenging in complex scenes, particularly for modeling low-frequency categories. To address this, this work proposes the RAD framework, which leverages an uncertainty-aware retrieval mechanism to fetch semantically similar geometric proxies from an external RGB-D database. A dual-stream network equipped with a matching cross-attention module is designed to fuse external geometric knowledge at reliable correspondences, thereby enhancing depth accuracy in low-confidence regions. The method achieves significant improvements on low-frequency categories, reducing the relative absolute error by 29.2%, 13.3%, and 7.2% on NYU Depth v2, KITTI, and Cityscapes, respectively, while maintaining state-of-the-art performance on standard benchmarks and approaching the quality of multi-view stereo results.

Technology Category

Application Category

📝 Abstract
Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art baselines on underrepresented classes, reducing relative absolute error by 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while maintaining competitive performance on standard in-domain benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Monocular Metric Depth Estimation
Underrepresented Classes
Depth Estimation
Low-frequency Objects
3D Scene Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented
Monocular Metric Depth Estimation
Underrepresented Classes
Cross-Attention Fusion
Uncertainty-Aware Retrieval
🔎 Similar Papers
No similar papers found.