MVSAnywhere: Zero-Shot Multi-View Stereo

📅 2025-03-28
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address poor generalization, variable input view counts, and unknown effective depth ranges in cross-domain (e.g., indoor-to-outdoor) multi-view depth estimation, this paper proposes a zero-shot cross-scene depth reconstruction method. Our approach introduces an adaptive cost volume fusion mechanism that jointly models monocular priors and multi-view geometric cues; integrates a Transformer-based architecture supporting variable-length view inputs; and employs metadata-driven scale-adaptive cost volume construction and optimization. Crucially, the method requires no target-domain training, accommodates arbitrary numbers of input views, and operates robustly under unknown depth ranges. Evaluated on the Robust Multi-View Depth Benchmark, it achieves state-of-the-art zero-shot performance—significantly outperforming existing monocular and multi-view depth estimation methods—while maintaining architectural flexibility and domain-agnostic inference.

Technology Category

Application Category

📝 Abstract
Computing accurate depth from multiple views is a fundamental and longstanding challenge in computer vision. However, most existing approaches do not generalize well across different domains and scene types (e.g. indoor vs. outdoor). Training a general-purpose multi-view stereo model is challenging and raises several questions, e.g. how to best make use of transformer-based architectures, how to incorporate additional metadata when there is a variable number of input views, and how to estimate the range of valid depths which can vary considerably across different scenes and is typically not known a priori? To address these issues, we introduce MVSA, a novel and versatile Multi-View Stereo architecture that aims to work Anywhere by generalizing across diverse domains and depth ranges. MVSA combines monocular and multi-view cues with an adaptive cost volume to deal with scale-related issues. We demonstrate state-of-the-art zero-shot depth estimation on the Robust Multi-View Depth Benchmark, surpassing existing multi-view stereo and monocular baselines.
Problem

Research questions and friction points this paper is trying to address.

Generalizing depth estimation across diverse domains and scenes
Handling variable input views and unknown depth ranges
Combining monocular and multi-view cues for scale adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses transformer-based architecture for generalization
Combines monocular and multi-view cues adaptively
Estimates depth range dynamically across scenes
🔎 Similar Papers
No similar papers found.