🤖 AI Summary
This work addresses the challenges of small object detection in drone imagery, which are primarily hindered by insufficient feature representation and inefficient multi-scale fusion. To overcome these limitations, we propose the EFSI-DETR framework, which innovatively integrates a Dynamic Frequency-Spatial Collaborative Network (DyFusNet), an Efficient Semantic Feature Condensation module (ESFC), and a Fine-Grained Feature Retention strategy (FFR). This design enables effective collaboration between frequency-domain and semantic information while balancing detection accuracy and inference efficiency. Evaluated on the VisDrone and CODrone datasets, our method achieves state-of-the-art performance, improving overall AP by 1.6% and significantly boosting AP_s for small objects by 5.8%. Moreover, it attains real-time inference at 188 FPS on a single RTX 4090 GPU.
📝 Abstract
Real-time small object detection in Unmanned Aerial Vehicle (UAV) imagery remains challenging due to limited feature representation and ineffective multi-scale fusion. Existing methods underutilize frequency information and rely on static convolutional operations, which constrain the capacity to obtain rich feature representations and hinder the effective exploitation of deep semantic features. To address these issues, we propose EFSI-DETR, a novel detection framework that integrates efficient semantic feature enhancement with dynamic frequency-spatial guidance. EFSI-DETR comprises two main components: (1) a Dynamic Frequency-Spatial Unified Synergy Network (DyFusNet) that jointly exploits frequency and spatial cues for robust multi-scale feature fusion, (2) an Efficient Semantic Feature Concentrator (ESFC) that enables deep semantic extraction with minimal computational cost. Furthermore, a Fine-grained Feature Retention (FFR) strategy is adopted to incorporate spatially rich shallow features during fusion to preserve fine-grained details, crucial for small object detection in UAV imagery. Extensive experiments on VisDrone and CODrone benchmarks demonstrate that our EFSI-DETR achieves the state-of-the-art performance with real-time efficiency, yielding improvement of \textbf{1.6}\% and \textbf{5.8}\% in AP and AP$_{s}$ on VisDrone, while obtaining \textbf{188} FPS inference speed on a single RTX 4090 GPU.