🤖 AI Summary
This work addresses the limitation of monocular depth estimation accuracy in multi-view geometry reconstruction. To bridge this gap, we explicitly embed multi-view geometric priors derived from Structure-from-Motion (SfM) sparse reconstructions into a monocular depth estimation framework. Methodologically, we propose a geometry-guided depth network, an end-to-end trainable geometric consistency loss, and a multi-scale feature alignment module—enabling tight coupling between monocular depth and multi-view geometry without explicit Multi-View Stereo (MVS) optimization. Our key contribution is the first integration of SfM-derived geometric constraints as strong, direct supervision within the monocular depth learning pipeline. Experiments demonstrate that our method achieves significantly higher depth prediction accuracy than state-of-the-art monocular approaches. Moreover, on diverse real-world scenes—including indoor, street-view, and aerial imagery—our reconstructed 3D geometry consistently surpasses current best MVS methods in quality.
📝 Abstract
In this paper, we present a new method for multi-view geometric reconstruction. In recent years, large vision models have rapidly developed, performing excellently across various tasks and demonstrating remarkable generalization capabilities. Some works use large vision models for monocular depth estimation, which have been applied to facilitate multi-view reconstruction tasks in an indirect manner. Due to the ambiguity of the monocular depth estimation task, the estimated depth values are usually not accurate enough, limiting their utility in aiding multi-view reconstruction. We propose to incorporate SfM information, a strong multi-view prior, into the depth estimation process, thus enhancing the quality of depth prediction and enabling their direct application in multi-view geometric reconstruction. Experimental results on public real-world datasets show that our method significantly improves the quality of depth estimation compared to previous monocular depth estimation works. Additionally, we evaluate the reconstruction quality of our approach in various types of scenes including indoor, streetscape, and aerial views, surpassing state-of-the-art MVS methods. The code and supplementary materials are available at https://zju3dv.github.io/murre/ .