🤖 AI Summary
This work addresses the challenge of metrically inaccurate monocular depth estimation on non-Lambertian surfaces—such as transparent or specular objects—which hinders reliable robotic manipulation and navigation. The authors propose a training-free depth alignment framework that leverages factor graph optimization to locally align monocular depth priors with raw sensor depth via affine transformations, preserving geometric details and boundary discontinuities while achieving metric accuracy. Key contributions include the first training-free method for metric alignment of monocular depth, the introduction of the first dense ground-truth benchmark dataset encompassing full-scene non-Lambertian objects—overcoming reliance on synthetic CAD models—and a novel data collection strategy combining multi-camera fusion with matte reflective spray. Evaluated across diverse sensors and complex real-world scenes, the approach significantly improves depth accuracy without any training, and the code is publicly released.
📝 Abstract
Dense and accurate depth estimation is essential for robotic manipulation, grasping, and navigation, yet currently available depth sensors are prone to errors on transparent, specular, and general non-Lambertian surfaces. To mitigate these errors, large-scale monocular depth estimation approaches provide strong structural priors, but their predictions can be potentially skewed or mis-scaled in metric units, limiting their direct use in robotics. Thus, in this work, we propose a training-free depth grounding framework that anchors monocular depth estimation priors from a depth foundation model in raw sensor depth through factor graph optimization. Our method performs a patch-wise affine alignment, locally grounding monocular predictions in metric real-world depth while preserving fine-grained geometric structure and discontinuities. To facilitate evaluation in challenging real-world conditions, we introduce a benchmark dataset with dense scene-wide ground truth depth in the presence of non-Lambertian objects. Ground truth is obtained via matte reflection spray and multi-camera fusion, overcoming the reliance on object-only CAD-based annotations used in prior datasets. Extensive evaluations across diverse sensors and domains demonstrate consistent improvements in depth performance without any (re-)training. We make our implementation publicly available at https://anchord.cs.uni-freiburg.de.