🤖 AI Summary
To address weak feature representation and inefficient cross-scale fusion in high-resolution satellite image object detection, this paper proposes GLOD: a novel architecture that replaces the conventional CNN backbone with Swin Transformer to enhance long-range dependency modeling; introduces an UpConvMixer upsampling module and a multi-scale Fusion Block for efficient feature reconstruction and fusion; innovatively adopts an asymmetric cross-layer fusion strategy incorporating CBAM attention mechanisms; and employs a multi-path detection head to strengthen multi-scale object representation. GLOD significantly improves the exploitation of spatial priors inherent in satellite imagery. Evaluated on the xView dataset, it achieves 32.95% mAP—surpassing the previous state-of-the-art by 11.46%—while attaining a superior trade-off between detection accuracy and computational efficiency.
📝 Abstract
We present GLOD, a transformer-first architecture for object detection in high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin Transformer for end-to-end feature extraction, combined with novel UpConvMixer blocks for robust upsampling and Fusion Blocks for multi-scale feature integration. Our approach achieves 32.95% on xView, outperforming SOTA methods by 11.46%. Key innovations include asymmetric fusion with CBAM attention and a multi-path head design capturing objects across scales. The architecture is optimized for satellite imagery challenges, leveraging spatial priors while maintaining computational efficiency.