π€ AI Summary
To address the need for real-time, accurate scene understanding in low-altitude unstructured environments for autonomous UAV navigation, this paper proposes a lightweight end-to-end joint learning framework that simultaneously performs semantic segmentation and monocular depth estimation from aerial imagery. The architecture employs a shared encoder with task-specific decoders, integrating multi-scale feature fusion and cross-task attention mechanisms to balance accuracy and efficiency. Evaluated on the MidAir and AeroScapes benchmarks, our method achieves state-of-the-art accuracy while maintaining an inference speed of 20.2 FPS and low GPU memory consumption. It outperforms both single-task baselines and existing joint-learning approaches in both accuracy and computational efficiency. The source code is publicly available.
π Abstract
Understanding the geometric and semantic properties of the scene is crucial in autonomous navigation and particularly challenging in the case of Unmanned Aerial Vehicle (UAV) navigation. Such information may be by obtained by estimating depth and semantic segmentation maps of the surrounding environment and for their practical use in autonomous navigation, the procedure must be performed as close to real-time as possible. In this paper, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture that can perform the two tasks accurately and rapidly, and validate its effectiveness on MidAir and Aeroscapes benchmark datasets. Our joint-architecture proves to be competitive or superior to the other single and joint architecture methods while performing its task fast predicting 20.2 FPS on a single NVIDIA quadro p5000 GPU and it has a low memory footprint. All codes for training and prediction can be found on this link: https://github.com/Malga-Vision/Co-SemDepth