🤖 AI Summary
This work addresses the challenge of achieving real-time, metrically consistent, and high-precision depth reconstruction from ultra-high-resolution drone imagery under wide baselines, low-texture scenes, specular reflections, occlusions, and stringent computational constraints. To this end, it introduces incremental clustering bundle adjustment (BA) into a zero-shot diffusion-based depth estimation framework for the first time. By constructing overlapping frame clusters and periodically optimizing camera poses alongside sparse 3D points, the method leverages reprojected depth maps to provide metric guidance, enabling training-free, temporally consistent depth estimation across sequences. Evaluated at a flight altitude of approximately 50 meters, the approach achieves sub-meter accuracy—0.87 m horizontally and 0.12 m vertically—with per-frame processing times ranging from 1.47 to 4.91 seconds, thus balancing real-time performance with photogrammetric-grade precision.
📝 Abstract
Real-time depth reconstruction from ultra-high-resolution UAV imagery is essential for time-critical geospatial tasks such as disaster response, yet remains challenging due to wide-baseline parallax, large image sizes, low-texture or specular surfaces, occlusions, and strict computational constraints. Recent zero-shot diffusion models offer fast per-image dense predictions without task-specific retraining, and require fewer labelled datasets than transformer-based predictors while avoiding the rigid capture geometry requirement of classical multi-view stereo. However, their probabilistic inference prevents reliable metric accuracy and temporal consistency across sequential frames and overlapping tiles. We present ZeD-MAP, a cluster-level framework that converts a test-time diffusion depth model into a metrically consistent, SLAM-like mapping pipeline by integrating incremental cluster-based bundle adjustment (BA). Streamed UAV frames are grouped into overlapping clusters; periodic BA produces metrically consistent poses and sparse 3D tie-points, which are reprojected into selected frames and used as metric guidance for diffusion-based depth estimation. Validation on ground-marker flights captured at approximately 50 m altitude (GSD is approximately 0.85 cm/px, corresponding to 2,650 square meters ground coverage per frame) with the DLR Modular Aerial Camera System (MACS) shows that our method achieves sub-meter accuracy, with approximately 0.87 m error in the horizontal (XY) plane and 0.12 m in the vertical (Z) direction, while maintaining per-image runtimes between 1.47 and 4.91 seconds. Results are subject to minor noise from manual point-cloud annotation. These findings show that BA-based metric guidance provides consistency comparable to classical photogrammetric methods while significantly accelerating processing, enabling real-time 3D map generation.