UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing monocular metric depth estimation methods exhibit poor generalization under cross-domain settings and fail to achieve zero-shot transfer to unseen data distributions. This paper proposes an end-to-end, calibration-free, reference-free, and fine-tuning-free framework that directly predicts metric depth maps with per-pixel confidence estimates and corresponding 3D point clouds from a single input image. Key contributions include: (1) a self-prompted camera module coupled with pseudo-spherical depth representation, explicitly decoupling camera parameters from scene geometry; (2) geometric invariance loss and edge-guided loss, improving zero-shot robustness and boundary accuracy; and (3) an uncertainty-aware output mechanism. Evaluated zero-shot across ten heterogeneous datasets, our method consistently outperforms state-of-the-art approaches, achieving significant improvements in depth accuracy, structural fidelity, and domain adaptability.

Technology Category

Application Category

📝 Abstract
Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE paradigm, UniDepthV2 directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepthV2 implements a self-promptable camera module predicting a dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles the camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss which enhances the localization and sharpness of edges in the metric depth outputs, a revisited, simplified and more efficient architectural design, and an additional uncertainty-level output which enables downstream tasks requiring confidence. Thorough evaluations on ten depth datasets in a zero-shot regime consistently demonstrate the superior performance and generalization of UniDepthV2. Code and models are available at https://github.com/lpiccinelli-eth/UniDepth
Problem

Research questions and friction points this paper is trying to address.

Generalizing monocular depth estimation across domains
Predicting 3D metric points from single images
Enhancing edge sharpness in depth outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-promptable camera module predicts dense representation.
Pseudo-spherical output disentangles camera and depth features.
Edge-guided loss enhances depth output sharpness.
🔎 Similar Papers
No similar papers found.