🤖 AI Summary
Traditional photogrammetric pipelines (e.g., COLMAP) struggle with sparse, unordered aerial image blocks—particularly under low overlap (<10 images) and absence of structural constraints—yielding poor pose estimation accuracy and incomplete dense point clouds.
Method: This work presents the first systematic evaluation of three end-to-end Transformer-based 3D reconstruction models—DUSt3R, MASt3R, and VGGT—on such challenging aerial imagery, benchmarking them against COLMAP.
Contribution/Results: All three models significantly outperform COLMAP on extremely sparse aerial data: point cloud completeness improves by up to 50%, and camera pose estimation exhibits markedly enhanced robustness. Among them, VGGT achieves the best trade-off between computational efficiency and reconstruction stability. The study delineates the applicability boundaries of foundational vision models in aerial photogrammetry and proposes a novel reconstruction paradigm that integrates Transformers as a complementary framework—enabling rapid, high-fidelity 3D modeling from small-scale or non-standard aerial datasets.
📝 Abstract
State-of-the-art 3D computer vision algorithms continue to advance in handling sparse, unordered image sets. Recently developed foundational models for 3D reconstruction, such as Dense and Unconstrained Stereo 3D Reconstruction (DUSt3R), Matching and Stereo 3D Reconstruction (MASt3R), and Visual Geometry Grounded Transformer (VGGT), have attracted attention due to their ability to handle very sparse image overlaps. Evaluating DUSt3R/MASt3R/VGGT on typical aerial images matters, as these models may handle extremely low image overlaps, stereo occlusions, and textureless regions. For redundant collections, they can accelerate 3D reconstruction by using extremely sparsified image sets. Despite tests on various computer vision benchmarks, their potential on photogrammetric aerial blocks remains unexplored. This paper conducts a comprehensive evaluation of the pre-trained DUSt3R/MASt3R/VGGT models on the aerial blocks of the UseGeo dataset for pose estimation and dense 3D reconstruction. Results show these methods can accurately reconstruct dense point clouds from very sparse image sets (fewer than 10 images, up to 518 pixels resolution), with completeness gains up to +50% over COLMAP. VGGT also demonstrates higher computational efficiency, scalability, and more reliable camera pose estimation. However, all exhibit limitations with high-resolution images and large sets, as pose reliability declines with more images and geometric complexity. These findings suggest transformer-based methods cannot fully replace traditional SfM and MVS, but offer promise as complementary approaches, especially in challenging, low-resolution, and sparse scenarios.