🤖 AI Summary
This work addresses the challenges of salient object detection in optical remote sensing imagery, where large scale variations and complex backgrounds hinder existing methods from effectively modeling geometric structures and fine-grained details, often yielding incomplete detection results. To overcome these limitations, the authors propose G2HFNet, a novel architecture built upon the Swin Transformer backbone that introduces, for the first time, a geometry-granularity-aware mechanism. The model incorporates a hierarchical feature fusion framework comprising four key components: Multi-scale Detail Enhancement (MDE), Dual-branch Geometry-Granularity Complementary (DGC), Deep Semantic Perception (DSP), and Local-Global Guided Fusion (LGF). Extensive experiments demonstrate that G2HFNet significantly outperforms state-of-the-art methods across multiple remote sensing datasets, particularly excelling in complex scenes by producing more complete and accurate saliency maps.
📝 Abstract
Remote sensing images captured from aerial perspectives often exhibit significant scale variations and complex backgrounds, posing challenges for salient object detection (SOD). Existing methods typically extract multi-level features at a single scale using uniform attention mechanisms, leading to suboptimal representations and incomplete detection results. To address these issues, we propose a GeoGran-Aware Hierarchical Feature Fusion Network (G2HFNet) that fully exploits geometric and granular cues in optical remote sensing images. Specifically, G2HFNet adopts Swin Transformer as the backbone to extract multi-level features and integrates three key modules: the multi-scale detail enhancement (MDE) module to handle object scale variations and enrich fine details, the dual-branch geo-gran complementary (DGC) module to jointly capture fine-grained details and positional information in mid-level features, and the deep semantic perception (DSP) module to refine high-level positional cues via self-attention. Additionally, a local-global guidance fusion (LGF) module is introduced to replace traditional convolutions for effective multi-level feature integration. Extensive experiments demonstrate that G2HFNet achieves high-quality saliency maps and significantly improves detection performance in challenging remote sensing scenarios.