🤖 AI Summary
Current learning-based multi-view stereo (MVS) methods often neglect geometric priors embedded in feature representations and correlation volumes, leading to insufficient robustness in cost volume matching. To address this, we propose a novel framework that explicitly models intra-view spatial coordinate dependencies and cross-view voxel contextual correlations. Our key contributions are: (1) the first joint modeling of intra-view spatial coordinate dependencies and cross-view voxel consistency guidance; (2) a lightweight cross-view aggregation module for efficient voxel-level correlation modeling; and (3) end-to-end differentiable depth regression. Evaluated on DTU and Tanks and Temples benchmarks, our method achieves state-of-the-art performance while improving inference speed by 23% and reducing GPU memory consumption by 31%.
📝 Abstract
Multi-view Stereo (MVS) aims to estimate depth and reconstruct 3D point clouds from a series of overlapping images. Recent learning-based MVS frameworks overlook the geometric information embedded in features and correlations, leading to weak cost matching. In this paper, we propose ICG-MVSNet, which explicitly integrates intra-view and cross-view relationships for depth estimation. Specifically, we develop an intra-view feature fusion module that leverages the feature coordinate correlations within a single image to enhance robust cost matching. Additionally, we introduce a lightweight cross-view aggregation module that efficiently utilizes the contextual information from volume correlations to guide regularization. Our method is evaluated on the DTU dataset and Tanks and Temples benchmark, consistently achieving competitive performance against state-of-the-art works, while requiring lower computational resources.