🤖 AI Summary
To address the limited accuracy and robustness of multi-view depth estimation under diverse camera configurations (e.g., varying relative poses and lens types), this paper proposes a plug-and-play iterative depth hypothesis pruning framework. Given an arbitrary initial depth map, the method adaptively resamples to generate a set of depth hypotheses and—novelly—introduces contrastive learning into the multi-view depth hypothesis space to learn scale- and configuration-invariant discriminative features. It further integrates multi-view geometric constraints with adaptive metric-space mapping to robustly select the optimal hypothesis. Evaluated on standard benchmarks, the approach significantly improves both depth and surface normal estimation accuracy, consistently outperforming state-of-the-art deep learning-based stereo matching methods.
📝 Abstract
We propose CHOSEN, a simple yet flexible, robust and effective multi-view depth refinement framework. It can be employed in any existing multi-view stereo pipeline, with straightforward generalization capability for different multi-view capture systems such as camera relative positioning and lenses. Given an initial depth estimation, CHOSEN iteratively re-samples and selects the best hypotheses, and automatically adapts to different metric or intrinsic scales determined by the capture system. The key to our approach is the application of contrastive learning in an appropriate solution space and a carefully designed hypothesis feature, based on which positive and negative hypotheses can be effectively distinguished. Integrated in a simple baseline multi-view stereo pipeline, CHOSEN delivers impressive quality in terms of depth and normal accuracy compared to many current deep learning based multi-view stereo pipelines.