🤖 AI Summary
To address the limitations of fixed receptive fields in multiscale land-cover representation and redundant feature interference introduced by standard self-attention in hyperspectral image (HSI) classification, this paper proposes a Spatial-Spectral Dual-Path Selective Fusion Transformer. Its core contributions are: (1) a Kernel Selective Fusion Transformer Block that adaptively learns optimal convolutional kernel sizes to dynamically adjust the receptive field; and (2) a Token Selective Fusion Transformer Block that jointly models spatial-spectral token importance for weighted fusion of discriminative features. The model integrates multiscale convolutional perception, learnable receptive field selection, and joint spatial-spectral self-attention. Experiments on PaviaU, Houston, Indian Pines, and WHU-HongHu datasets achieve overall accuracies of 96.59%, 97.66%, 95.17%, and 94.59%, respectively—averaging 2.01% higher than state-of-the-art methods.
📝 Abstract
Transformer has achieved satisfactory results in the field of hyperspectral image (HSI) classification. However, existing Transformer models face two key challenges when dealing with HSI scenes characterized by diverse land cover types and rich spectral information: (1) A fixed receptive field overlooks the effective contextual scales required by various HSI objects; (2) invalid self-attention features in context fusion affect model performance. To address these limitations, we propose a novel Dual Selective Fusion Transformer Network (DSFormer) for HSI classification. DSFormer achieves joint spatial and spectral contextual modeling by flexibly selecting and fusing features across different receptive fields, effectively reducing unnecessary information interference by focusing on the most relevant spatial-spectral tokens. Specifically, we design a Kernel Selective Fusion Transformer Block (KSFTB) to learn an optimal receptive field by adaptively fusing spatial and spectral features across different scales, enhancing the model's ability to accurately identify diverse HSI objects. Additionally, we introduce a Token Selective Fusion Transformer Block (TSFTB), which strategically selects and combines essential tokens during the spatial-spectral self-attention fusion process to capture the most crucial contexts. Extensive experiments conducted on four benchmark HSI datasets demonstrate that the proposed DSFormer significantly improves land cover classification accuracy, outperforming existing state-of-the-art methods. Specifically, DSFormer achieves overall accuracies of 96.59%, 97.66%, 95.17%, and 94.59% in the Pavia University, Houston, Indian Pines, and Whu-HongHu datasets, respectively, reflecting improvements of 3.19%, 1.14%, 0.91%, and 2.80% over the previous model. The code will be available online at https://github.com/YichuXu/DSFormer.