Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-Shot Metric Depth and Surface Normal Estimation

📅 2024-03-22

🏛️ IEEE Transactions on Pattern Analysis and Machine Intelligence

📈 Citations: 40

✨ Influential: 11

career value

214K/year

🤖 AI Summary

Existing monocular depth estimation methods produce only relative depth without metric scale, while surface normal estimation suffers from poor generalization to unseen scenes. This paper introduces the first monocular geometric foundation model capable of zero-shot, image-level joint estimation of metric depth and surface normals, enabling plug-and-play 3D metric reconstruction for arbitrary camera parameters and unknown scenes. Key contributions include: (1) a camera-agnostic canonical camera space transformation module that explicitly decouples and resolves depth scale ambiguity; (2) a depth-normal joint optimization mechanism that enhances zero-shot generalization of normal estimation; and (3) training on a large-scale, heterogeneous dataset comprising 16 million images and 1,000 camera models with diverse annotations. Experiments demonstrate state-of-the-art performance on both metric depth and surface normal estimation, significantly mitigating scale drift in monocular SLAM and enabling high-fidelity, dense metric 3D mapping from Internet-sourced images.

Technology Category

Application Category

📝 Abstract

We introduce Metric3D v2, a geometric foundation model designed for zero-shot metric depth and surface normal estimation from single images, critical for accurate 3D recovery. Depth and normal estimation, though complementary, present distinct challenges. State-of-the-art monocular depth methods achieve zero-shot generalization through affine-invariant depths, but fail to recover real-world metric scale. Conversely, current normal estimation techniques struggle with zero-shot performance due to insufficient labeled data. We propose targeted solutions for both metric depth and normal estimation. For metric depth, we present a canonical camera space transformation module that resolves metric ambiguity across various camera models and large-scale datasets, which can be easily integrated into existing monocular models. For surface normal estimation, we introduce a joint depth-normal optimization module that leverages diverse data from metric depth, allowing normal estimators to improve beyond traditional labels. Our model, trained on over 16 million images from thousands of camera models with varied annotations, excels in zero-shot generalization to new camera settings. As shown in Fig. 1, It ranks the 1st in multiple zero-shot and standard benchmarks for metric depth and surface normal prediction. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. Our model also relieves the scale drift issues of monocular-SLAM (Fig. 3), leading to high-quality metric scale dense mapping. Such applications highlight the versatility of Metric3D v2 models as geometric foundation models.

Problem

Research questions and friction points this paper is trying to address.

Monocular Depth Estimation

Real-world Measurement

Surface Orientation Estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single Image 3D Modeling

Depth Estimation

Surface Orientation Optimization

🔎 Similar Papers

No similar papers found.

World Labs

$250,000-$350,000 base salary (good-faith estimate for San Francisco Bay Area upon hire; actual offer based on experience, skills, and qualifications)

San Francisco / San Francisco Office, San Francisco, California, United States

3D Computer Vision Researcher

Kitware

Arlington, Virginia

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)