Lightweight and Accurate Multi-View Stereo with Confidence-Aware Diffusion Model

πŸ“… 2025-09-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the longstanding trade-off between accuracy and efficiency in multi-view stereo (MVS) depth estimation, this paper introduces DiffMVSβ€”the first framework to incorporate conditional diffusion models into MVS depth map refinement. DiffMVS employs a lightweight 2D U-Net backbone augmented with convolutional Gated Recurrent Units (ConvGRUs) for iterative depth estimation; a dedicated conditional encoder steers the diffusion process using multi-view geometric cues; and a confidence-aware sampling strategy adaptively refines depth hypotheses. The resulting CasDiffMVS achieves state-of-the-art performance on DTU, Tanks & Temples, and ETH3D benchmarks, while significantly reducing inference time and GPU memory consumption. This marks a substantial advancement toward jointly optimizing accuracy and computational efficiency in learning-based MVS.

Technology Category

Application Category

πŸ“ Abstract
To reconstruct the 3D geometry from calibrated images, learning-based multi-view stereo (MVS) methods typically perform multi-view depth estimation and then fuse depth maps into a mesh or point cloud. To improve the computational efficiency, many methods initialize a coarse depth map and then gradually refine it in higher resolutions. Recently, diffusion models achieve great success in generation tasks. Starting from a random noise, diffusion models gradually recover the sample with an iterative denoising process. In this paper, we propose a novel MVS framework, which introduces diffusion models in MVS. Specifically, we formulate depth refinement as a conditional diffusion process. Considering the discriminative characteristic of depth estimation, we design a condition encoder to guide the diffusion process. To improve efficiency, we propose a novel diffusion network combining lightweight 2D U-Net and convolutional GRU. Moreover, we propose a novel confidence-based sampling strategy to adaptively sample depth hypotheses based on the confidence estimated by diffusion model. Based on our novel MVS framework, we propose two novel MVS methods, DiffMVS and CasDiffMVS. DiffMVS achieves competitive performance with state-of-the-art efficiency in run-time and GPU memory. CasDiffMVS achieves state-of-the-art performance on DTU, Tanks & Temples and ETH3D. Code is available at: https://github.com/cvg/diffmvs.
Problem

Research questions and friction points this paper is trying to address.

Improves multi-view stereo 3D reconstruction accuracy
Enhances computational efficiency in depth estimation
Introduces confidence-aware diffusion for depth refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion model for depth refinement
Lightweight 2D U-Net with GRU
Confidence-based adaptive sampling strategy
πŸ”Ž Similar Papers