🤖 AI Summary
Existing monocular depth estimation methods produce only relative depth without metric scale, while surface normal estimation suffers from poor generalization to unseen scenes. This paper introduces the first monocular geometric foundation model capable of zero-shot, image-level joint estimation of metric depth and surface normals, enabling plug-and-play 3D metric reconstruction for arbitrary camera parameters and unknown scenes. Key contributions include: (1) a camera-agnostic canonical camera space transformation module that explicitly decouples and resolves depth scale ambiguity; (2) a depth-normal joint optimization mechanism that enhances zero-shot generalization of normal estimation; and (3) training on a large-scale, heterogeneous dataset comprising 16 million images and 1,000 camera models with diverse annotations. Experiments demonstrate state-of-the-art performance on both metric depth and surface normal estimation, significantly mitigating scale drift in monocular SLAM and enabling high-fidelity, dense metric 3D mapping from Internet-sourced images.
📝 Abstract
We introduce Metric3D v2, a geometric foundation model designed for zero-shot metric depth and surface normal estimation from single images, critical for accurate 3D recovery. Depth and normal estimation, though complementary, present distinct challenges. State-of-the-art monocular depth methods achieve zero-shot generalization through affine-invariant depths, but fail to recover real-world metric scale. Conversely, current normal estimation techniques struggle with zero-shot performance due to insufficient labeled data. We propose targeted solutions for both metric depth and normal estimation. For metric depth, we present a canonical camera space transformation module that resolves metric ambiguity across various camera models and large-scale datasets, which can be easily integrated into existing monocular models. For surface normal estimation, we introduce a joint depth-normal optimization module that leverages diverse data from metric depth, allowing normal estimators to improve beyond traditional labels. Our model, trained on over 16 million images from thousands of camera models with varied annotations, excels in zero-shot generalization to new camera settings. As shown in Fig. 1, It ranks the 1st in multiple zero-shot and standard benchmarks for metric depth and surface normal prediction. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. Our model also relieves the scale drift issues of monocular-SLAM (Fig. 3), leading to high-quality metric scale dense mapping. Such applications highlight the versatility of Metric3D v2 models as geometric foundation models.