Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak feature representation and inefficient cross-scale fusion in high-resolution satellite image object detection, this paper proposes GLOD: a novel architecture that replaces the conventional CNN backbone with Swin Transformer to enhance long-range dependency modeling; introduces an UpConvMixer upsampling module and a multi-scale Fusion Block for efficient feature reconstruction and fusion; innovatively adopts an asymmetric cross-layer fusion strategy incorporating CBAM attention mechanisms; and employs a multi-path detection head to strengthen multi-scale object representation. GLOD significantly improves the exploitation of spatial priors inherent in satellite imagery. Evaluated on the xView dataset, it achieves 32.95% mAP—surpassing the previous state-of-the-art by 11.46%—while attaining a superior trade-off between detection accuracy and computational efficiency.

Technology Category

Application Category

📝 Abstract
We present GLOD, a transformer-first architecture for object detection in high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin Transformer for end-to-end feature extraction, combined with novel UpConvMixer blocks for robust upsampling and Fusion Blocks for multi-scale feature integration. Our approach achieves 32.95% on xView, outperforming SOTA methods by 11.46%. Key innovations include asymmetric fusion with CBAM attention and a multi-path head design capturing objects across scales. The architecture is optimized for satellite imagery challenges, leveraging spatial priors while maintaining computational efficiency.
Problem

Research questions and friction points this paper is trying to address.

Improving object detection in high-resolution satellite imagery
Combining Transformers and CNNs for efficient feature extraction
Addressing multi-scale object detection with novel fusion blocks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Swin Transformer replaces CNN backbones
UpConvMixer blocks for robust upsampling
Asymmetric fusion with CBAM attention
🔎 Similar Papers
No similar papers found.
N
Nicolas Drapier
L2TI Laboratory, Institut Galilée, Université Sorbonne Paris Nord, SAS Impact
Aladine Chetouani
Aladine Chetouani
Institut Galilée - L2TI - Multimedia Team
Image Quality AssessmentVideo AnalysisDepp LearningPattern Recognition
A
Aurélien Chateigner
SAS Impact