🤖 AI Summary
To address performance bottlenecks in detecting tiny objects in high-resolution aerial imagery—stemming from weak global contextual awareness in shallow features and loss of multi-scale details—this paper proposes a frequency-decoupled, multi-domain collaborative detection framework. Methodologically, it introduces (1) the Wavelet Kolmogorov-Arnold Transformer (WKAT), a novel backbone integrating wavelet-based multi-scale decomposition with Kolmogorov-Arnold nonlinear representation learning; and (2) a cross-stage partial fusion module coupled with a unified spatial-frequency-structural coordination mechanism, enabling dynamic balance between low-frequency semantic enhancement and high-frequency detail preservation. Evaluated on the VisDrone dataset, the method achieves state-of-the-art performance under parameter-constrained settings: +6.5% AP and +8.2% AP₅₀, while employing fewer parameters than competing approaches.
📝 Abstract
Aerial-view object detection is a critical technology for real-world applications such as natural resource monitoring, traffic management, and UAV-based search and rescue. Detecting tiny objects in high-resolution aerial imagery presents a long-standing challenge due to their limited visual cues and the difficulty of modeling global context in complex scenes. Existing methods are often hampered by delayed contextual fusion and inadequate non-linear modeling, failing to effectively use global information to refine shallow features and thus encountering a performance bottleneck. To address these challenges, we propose FMC-DETR, a novel framework with frequency-decoupled fusion for aerial-view object detection. First, we introduce the Wavelet Kolmogorov-Arnold Transformer (WeKat) backbone, which applies cascaded wavelet transforms to enhance global low-frequency context perception in shallow features while preserving fine-grained details, and employs Kolmogorov-Arnold networks to achieve adaptive non-linear modeling of multi-scale dependencies. Next, a lightweight Cross-stage Partial Fusion (CPF) module reduces redundancy and improves multi-scale feature interaction. Finally, we introduce the Multi-Domain Feature Coordination (MDFC) module, which unifies spatial, frequency, and structural priors to to balance detail preservation and global enhancement. Extensive experiments on benchmark aerial-view datasets demonstrate that FMC-DETR achieves state-of-the-art performance with fewer parameters. On the challenging VisDrone dataset, our model achieves improvements of 6.5% AP and 8.2% AP50 over the baseline, highlighting its effectiveness in tiny object detection. The code can be accessed at https://github.com/bloomingvision/FMC-DETR.