🤖 AI Summary
This study addresses the trade-offs between accuracy, coverage, robustness, and efficiency in multi-view stereo (MVS) methods for aerial 3D reconstruction. While traditional approaches like COLMAP achieve high geometric fidelity, they suffer from low computational efficiency and poor scalability; meanwhile, learning-based MVS methods lack systematic evaluation in real-world aerial scenarios. This work presents the first comprehensive benchmark comparing COLMAP against state-of-the-art learning-based methods—including MVSNet, PatchmatchNet, MVSFormer++, DUSt3R, and VGGT—on a real aerial dataset. Results show that COLMAP excels in geometric consistency but is computationally expensive, whereas learning-based methods demonstrate greater robustness to image registration failures. Notably, DUSt3R and VGGT achieve competitive accuracy with significantly improved speed, though they still exhibit larger residuals in complex scenes. These findings provide empirical guidance for selecting and optimizing MVS pipelines in photogrammetric applications.
📝 Abstract
Photogrammetric 3D reconstruction has long relied on traditional Structure-from-Motion (SfM) and Multi-View Stereo (MVS) methods, which provide high accuracy but face challenges in speed and scalability. Recently, learning-based MVS methods have emerged, aiming for faster and more efficient reconstruction. This work presents a comparative evaluation between a representative traditional MVS pipeline (COLMAP) and state-of-the-art learning-based approaches, including geometry-guided methods (MVSNet, PatchmatchNet, MVSAnywhere, MVSFormer++) and end-to-end frameworks (Stereo4D, FoundationStereo, DUSt3R, MASt3R, Fast3R, VGGT). Two experiments were conducted on different aerial scenarios. The first experiment used the MARS-LVIG dataset, where ground-truth 3D reconstruction was provided by LiDAR point clouds. The second experiment used a public scene from the Pix4D official website, with ground truth generated by Pix4Dmapper. We evaluated accuracy, coverage, and runtime across all methods. Experimental results show that although COLMAP can provide reliable and geometrically consistent reconstruction results, it requires more computation time. In cases where traditional methods fail in image registration, learning-based approaches exhibit stronger feature-matching capability and greater robustness. Geometry-guided methods usually require careful dataset preparation and often depend on camera pose or depth priors generated by COLMAP. End-to-end methods such as DUSt3R and VGGT achieve competitive accuracy and reasonable coverage while offering substantially faster reconstruction. However, they exhibit relatively large residuals in 3D reconstruction, particularly in challenging scenarios.