🤖 AI Summary
Remote sensing image scene classification faces challenges arising from complex spatial structures and multi-scale variations of ground objects. To address these, this paper proposes AFM-Net—a novel architecture that deeply integrates the multi-scale local prior modeling capability of convolutional neural networks (CNNs) with the efficient global sequential modeling capacity of Mamba. A hierarchical dynamic fusion mechanism is designed to enable cross-level feature interaction and contextual reconstruction. Furthermore, a Mixture-of-Experts (MoE) classification module is introduced to adaptively route features and enhance fine-grained discriminability. Extensive experiments demonstrate state-of-the-art performance: AFM-Net achieves 93.72%, 95.54%, and 96.92% classification accuracy on the AID, NWPU-RESISC45, and UC Merced benchmarks, respectively—outperforming existing methods while achieving a superior balance between accuracy and computational efficiency.
📝 Abstract
Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Existing approaches see CNNs excel at modeling local textures, while Transformers excel at capturing global context. However, efficiently integrating them remains a bottleneck due to the high computational cost of Transformers. To tackle this, we propose AFM-Net, a novel Advanced Hierarchical Fusing framework that achieves effective local and global co-representation through two pathways: a CNN branch for extracting hierarchical visual priors, and a Mamba branch for efficient global sequence modeling. The core innovation of AFM-Net lies in its Hierarchical Fusion Mechanism, which progressively aggregates multi-scale features from both pathways, enabling dynamic cross-level feature interaction and contextual reconstruction to produce highly discriminative representations. These fused features are then adaptively routed through a Mixture-of-Experts classifier module, which dispatches them to the most suitable experts for fine-grained scene recognition. Experiments on AID, NWPU-RESISC45, and UC Merced show that AFM-Net obtains 93.72, 95.54, and 96.92 percent accuracy, surpassing state-of-the-art methods with balanced performance and efficiency. Code is available at https://github.com/tangyuanhao-qhu/AFM-Net.