🤖 AI Summary
To address the degraded detection performance in aerial imagery caused by small object sizes, high object density, severe blur, and occlusion, this paper proposes an end-to-end detection framework integrating super-resolution enhancement with a lightweight YOLOv5 architecture. Innovatively, a Transformer encoder is embedded into the YOLOv5 backbone to jointly model long-range dependencies and global contextual information alongside super-resolution preprocessing, effectively alleviating the information bottleneck for small objects. The framework is jointly trained and validated on multiple large-scale aerial benchmark datasets—including VisDrone-2023 and SeaDroneSee—achieving a mean Average Precision (mAP) of 52.5%, surpassing current state-of-the-art methods. Designed for efficiency, the model maintains high accuracy while enabling real-time inference, making it well-suited for edge deployment on resource-constrained UAV platforms.
📝 Abstract
The demand for accurate object detection in aerial imagery has surged with the widespread use of drones and satellite technology. Traditional object detection models, trained on datasets biased towards large objects, struggle to perform optimally in aerial scenarios where small, densely clustered objects are prevalent. To address this challenge, we present an innovative approach that combines super-resolution and an adapted lightweight YOLOv5 architecture. We employ a range of datasets, including VisDrone-2023, SeaDroneSee, VEDAI, and NWPU VHR-10, to evaluate our model's performance. Our Super Resolved YOLOv5 architecture features Transformer encoder blocks, allowing the model to capture global context and context information, leading to improved detection results, especially in high-density, occluded conditions. This lightweight model not only delivers improved accuracy but also ensures efficient resource utilization, making it well-suited for real-time applications. Our experimental results demonstrate the model's superior performance in detecting small and densely clustered objects, underlining the significance of dataset choice and architectural adaptation for this specific task. In particular, the method achieves 52.5% mAP on VisDrone, exceeding top prior works. This approach promises to significantly advance object detection in aerial imagery, contributing to more accurate and reliable results in a variety of real-world applications.