🤖 AI Summary
To address poor generalization, variable input view counts, and unknown effective depth ranges in cross-domain (e.g., indoor-to-outdoor) multi-view depth estimation, this paper proposes a zero-shot cross-scene depth reconstruction method. Our approach introduces an adaptive cost volume fusion mechanism that jointly models monocular priors and multi-view geometric cues; integrates a Transformer-based architecture supporting variable-length view inputs; and employs metadata-driven scale-adaptive cost volume construction and optimization. Crucially, the method requires no target-domain training, accommodates arbitrary numbers of input views, and operates robustly under unknown depth ranges. Evaluated on the Robust Multi-View Depth Benchmark, it achieves state-of-the-art zero-shot performance—significantly outperforming existing monocular and multi-view depth estimation methods—while maintaining architectural flexibility and domain-agnostic inference.
📝 Abstract
Computing accurate depth from multiple views is a fundamental and longstanding challenge in computer vision. However, most existing approaches do not generalize well across different domains and scene types (e.g. indoor vs. outdoor). Training a general-purpose multi-view stereo model is challenging and raises several questions, e.g. how to best make use of transformer-based architectures, how to incorporate additional metadata when there is a variable number of input views, and how to estimate the range of valid depths which can vary considerably across different scenes and is typically not known a priori? To address these issues, we introduce MVSA, a novel and versatile Multi-View Stereo architecture that aims to work Anywhere by generalizing across diverse domains and depth ranges. MVSA combines monocular and multi-view cues with an adaptive cost volume to deal with scale-related issues. We demonstrate state-of-the-art zero-shot depth estimation on the Robust Multi-View Depth Benchmark, surpassing existing multi-view stereo and monocular baselines.