๐ค AI Summary
This paper addresses the challenging problem of multi-view stereo (MVS) reconstruction in the absence of ground-truth depth labels. To this end, we propose DFM-MVSโa novel framework that pioneers the integration of depth foundation models (DFMs) to generate high-confidence depth priors. Leveraging these priors, we establish a prior-driven pseudo-supervision training paradigm and design a prior-guided error correction module, enabling coarse-to-fine stereo matching optimization and explicit geometric consistency modeling. Crucially, DFM-MVS operates without any real depth supervision, effectively mitigating key bottlenecks in unsupervised MVSโnamely, severe noise in pseudo-labels and weak geometric constraints. Extensive experiments on DTU and Tanks & Temples benchmarks demonstrate that DFM-MVS consistently outperforms existing unsupervised and self-supervised methods, achieving reconstruction accuracy close to state-of-the-art supervised approaches. These results underscore the pivotal role and strong generalizability of depth priors in weakly supervised MVS.
๐ Abstract
Learning-based Multi-View Stereo (MVS) methods have made remarkable progress in recent years. However, how to effectively train the network without using real-world labels remains a challenging problem. In this paper, driven by the recent advancements of vision foundation models, a novel method termed DFM-MVS, is proposed to leverage the depth foundation model to generate the effective depth prior, so as to boost MVS in the absence of real-world labels. Specifically, a depth prior-based pseudo-supervised training mechanism is developed to simulate realistic stereo correspondences using the generated depth prior, thereby constructing effective supervision for the MVS network. Besides, a depth prior-guided error correction strategy is presented to leverage the depth prior as guidance to mitigate the error propagation problem inherent in the widely-used coarse-to-fine network structure. Experimental results on DTU and Tanks&Temples datasets demonstrate that the proposed DFM-MVS significantly outperforms existing MVS methods without using real-world labels.