🤖 AI Summary
This work addresses the problem of open-domain single-image 3D geometric reconstruction. To resolve global scale and translation ambiguities, we propose an affine-invariant 3D point cloud representation. Methodologically, we design an optimal point cloud alignment solver and a multi-scale local geometric consistency loss to mitigate the inherent ambiguity of monocular geometric supervision. Our approach integrates affine-invariant representation learning, robust point cloud registration, and end-to-end training on a hybrid large-scale dataset. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple unseen benchmarks. It significantly improves accuracy and generalization in monocular 3D point cloud reconstruction, depth estimation, and field-of-view prediction. By eliminating the need for camera calibration or explicit metric priors, our framework establishes a new paradigm for uncalibrated single-image geometric understanding.
📝 Abstract
We present MoGe, a powerful model for recovering 3D geometry from monocular open-domain images. Given a single image, our model directly predicts a 3D point map of the captured scene with an affine-invariant representation, which is agnostic to true global scale and shift. This new representation precludes ambiguous supervision in training and facilitate effective geometry learning. Furthermore, we propose a set of novel global and local geometry supervisions that empower the model to learn high-quality geometry. These include a robust, optimal, and efficient point cloud alignment solver for accurate global shape learning, and a multi-scale local geometry loss promoting precise local geometry supervision. We train our model on a large, mixed dataset and demonstrate its strong generalizability and high accuracy. In our comprehensive evaluation on diverse unseen datasets, our model significantly outperforms state-of-the-art methods across all tasks, including monocular estimation of 3D point map, depth map, and camera field of view. Code and models can be found on our project page.