MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision

📅 2024-10-24

🏛️ arXiv.org

📈 Citations: 19

✨ Influential: 5

career value

222K/year

🤖 AI Summary

This work addresses the problem of open-domain single-image 3D geometric reconstruction. To resolve global scale and translation ambiguities, we propose an affine-invariant 3D point cloud representation. Methodologically, we design an optimal point cloud alignment solver and a multi-scale local geometric consistency loss to mitigate the inherent ambiguity of monocular geometric supervision. Our approach integrates affine-invariant representation learning, robust point cloud registration, and end-to-end training on a hybrid large-scale dataset. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple unseen benchmarks. It significantly improves accuracy and generalization in monocular 3D point cloud reconstruction, depth estimation, and field-of-view prediction. By eliminating the need for camera calibration or explicit metric priors, our framework establishes a new paradigm for uncalibrated single-image geometric understanding.

Technology Category

Application Category

📝 Abstract

We present MoGe, a powerful model for recovering 3D geometry from monocular open-domain images. Given a single image, our model directly predicts a 3D point map of the captured scene with an affine-invariant representation, which is agnostic to true global scale and shift. This new representation precludes ambiguous supervision in training and facilitate effective geometry learning. Furthermore, we propose a set of novel global and local geometry supervisions that empower the model to learn high-quality geometry. These include a robust, optimal, and efficient point cloud alignment solver for accurate global shape learning, and a multi-scale local geometry loss promoting precise local geometry supervision. We train our model on a large, mixed dataset and demonstrate its strong generalizability and high accuracy. In our comprehensive evaluation on diverse unseen datasets, our model significantly outperforms state-of-the-art methods across all tasks, including monocular estimation of 3D point map, depth map, and camera field of view. Code and models can be found on our project page.

Problem

Research questions and friction points this paper is trying to address.

Recovering 3D geometry from monocular open-domain images

Predicting affine-invariant 3D point maps without global scale ambiguity

Enhancing geometry learning with novel global and local supervisions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Affine-invariant 3D point map representation

Novel global and local geometry supervisions

Robust optimal point cloud alignment solver

🔎 Similar Papers

A New Dataset for Monocular Depth Estimation Under Viewpoint Shifts