MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inaccurate depth estimation in textureless and reflective regions caused by feature matching failures in multi-view stereo (MVS), this paper proposes a monocular-feature- and depth-prior-guided MVS network. Methodologically, we design a cross-view attention mechanism coupled with cross-view positional encoding to enhance geometric consistency; introduce monocular depth alignment and differentiable sampling to tightly fuse monocular priors with multi-view geometric constraints; and propose a dynamic depth hypothesis update strategy along with a relative consistency loss to improve the robustness of depth candidates. Our approach achieves state-of-the-art performance on the DTU and Tanks-and-Temples benchmarks—ranking first on both the Intermediate and Advanced subsets of Tanks-and-Temples—and significantly improves dense point cloud reconstruction accuracy in challenging regions.

Technology Category

Application Category

📝 Abstract
Learning-based Multi-View Stereo (MVS) methods aim to predict depth maps for a sequence of calibrated images to recover dense point clouds. However, existing MVS methods often struggle with challenging regions, such as textureless regions and reflective surfaces, where feature matching fails. In contrast, monocular depth estimation inherently does not require feature matching, allowing it to achieve robust relative depth estimation in these regions. To bridge this gap, we propose MonoMVSNet, a novel monocular feature and depth guided MVS network that integrates powerful priors from a monocular foundation model into multi-view geometry. Firstly, the monocular feature of the reference view is integrated into source view features by the attention mechanism with a newly designed cross-view position encoding. Then, the monocular depth of the reference view is aligned to dynamically update the depth candidates for edge regions during the sampling procedure. Finally, a relative consistency loss is further designed based on the monocular depth to supervise the depth prediction. Extensive experiments demonstrate that MonoMVSNet achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets, ranking first on the Tanks-and-Temples Intermediate and Advanced benchmarks. The source code is available at https://github.com/JianfeiJ/MonoMVSNet.
Problem

Research questions and friction points this paper is trying to address.

Improves depth prediction in textureless and reflective regions
Integrates monocular depth priors into multi-view stereo networks
Enhances feature matching robustness via cross-view attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monocular feature integration via attention mechanism
Dynamic depth candidate updating for edge regions
Relative consistency loss based on monocular depth
🔎 Similar Papers
No similar papers found.