🤖 AI Summary
This work addresses the challenges of hyperspectral image classification under extremely limited labeled samples and high-dimensional spectral data, compounded by the prohibitive computational complexity of conventional Transformers. To this end, the authors propose a hybrid architecture that integrates State Space Models (Mamba) with Transformers, augmented by a 3D-CNN spectral front-end and a vision–text dual-modality prompting mechanism to effectively capture long-range dependencies in low-data regimes. Notably, this is the first study to introduce Mamba into hyperspectral classification, leveraging parameter-efficient fine-tuning to mitigate label scarcity. Evaluated with only 2% of training samples, the method achieves overall accuracies of 99.69% on Salinas and 99.45% on Longkou, significantly outperforming existing approaches and establishing a new state of the art for data-scarce hyperspectral classification.
📝 Abstract
Accurate classification of hyperspectral imagery (HSI) is often frustrated by the tension between high-dimensional spectral data and the extreme scarcity of labeled training samples. While hierarchical models like LoLA-SpecViT have demonstrated the power of local windowed attention and parameter-efficient fine-tuning, the quadratic complexity of standard Transformers remains a barrier to scaling. We introduce VP-Hype, a framework that rethinks HSI classification by unifying the linear-time efficiency of State-Space Models (SSMs) with the relational modeling of Transformers in a novel hybrid architecture. Building on a robust 3D-CNN spectral front-end, VP-Hype replaces conventional attention blocks with a Hybrid Mamba-Transformer backbone to capture long-range dependencies with significantly reduced computational overhead. Furthermore, we address the label-scarcity problem by integrating dual-modal Visual and Textual Prompts that provide context-aware guidance for the feature extraction process. Our experimental evaluation demonstrates that VP-Hype establishes a new state of the art in low-data regimes. Specifically, with a training sample distribution of only 2\%, the model achieves Overall Accuracy (OA) of 99.69\% on the Salinas dataset and 99.45\% on the Longkou dataset. These results suggest that the convergence of hybrid sequence modeling and multi-modal prompting provides a robust path forward for high-performance, sample-efficient remote sensing.