A Synergistic CNN-Transformer Network with Pooling Attention Fusion for Hyperspectral Image Classification

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenges of hyperspectral image classification, particularly the difficulty in effectively fusing spatial-spectral information and the loss of critical features across network layers. To overcome these issues, the authors propose a collaborative CNN-Transformer architecture that employs a dual-branch design: one branch leverages 3D/2D convolutions to capture spatial details, while the other utilizes a Vision Transformer to model spectral dependencies. A hybrid pooling attention mechanism is introduced to enhance feature discriminability, and a cascaded Transformer encoder enables global contextual modeling. Furthermore, a cross-layer feature fusion strategy is adopted to mitigate information degradation during deep feature propagation. Extensive experiments on multiple benchmark hyperspectral datasets demonstrate that the proposed method significantly outperforms current state-of-the-art approaches, confirming its effectiveness and strong generalization capability.

Technology Category

Application Category

📝 Abstract

In the hyperspectral image (HSI) classification task, each pixel is categorized into a specific land-cover category or material. Convolutional neural networks (CNNs) and transformers have been widely used to extract local and non-local features in HSI classification. Recent works have utilized a multi-scale vision transformer (ViT) to enhance spectral feature capture and yield promising results. However, most existing methods still face challenges in the effective joint use of spatial-spectral information and in preserving information across layers during the propagation process. To address these issues, we propose a synergistic CNN-Transformer network with pooling attention fusion for HSI classification, which collaboratively utilizes CNNs and ViT to process spatial and spectral features separately. Specifically, we propose a Twin-Branch Feature Extraction (TBFE) module, which employs 3D and 2D convolution in parallel to comprehensively extract spectral and spatial features from HSI. A hybrid pooling attention (HPA) module is designed to aggregate spatial attention. Moreover, a cascade transformer encoder is employed for global spectral feature extraction, and a simple yet efficient cross-layer feature fusion (CFF) module is designed to reduce the loss of crucial information in the previous network layers. Extensive experiments are conducted on several representative datasets to demonstrate the superior performance of our proposed method compared to the state-of-the-art works. Code is available at https://github.com/chenpeng052/SCT-Net.git.

Problem

Research questions and friction points this paper is trying to address.

hyperspectral image classification

spatial-spectral information

feature preservation

CNN-Transformer synergy

Innovation

Methods, ideas, or system contributions that make the work stand out.

CNN-Transformer synergy

pooling attention fusion

twin-branch feature extraction

hyperspectral image classification