🤖 AI Summary
To address redundancy in token representations caused by global self-attention, insufficient multi-scale modeling, and difficulty preserving high-frequency details (e.g., edges and textures) in Transformer-based hyperspectral pansharpening, this paper proposes a synergistic framework integrating high-frequency enhancement and critical token selection. Methodologically, it introduces: (1) a critical token selection attention mechanism to suppress attention dispersion over redundant tokens; (2) a multi-level variance-aware feed-forward network that explicitly encodes spectral–spatial priors and strengthens high-frequency response; and (3) a token-level high-frequency enhancement strategy jointly optimizing spectral fidelity and spatial detail recovery. Evaluated on standard benchmarks, the method achieves state-of-the-art reconstruction quality with reduced computational overhead, significantly improving both spatial detail restoration and spectral consistency.
📝 Abstract
Transformer-based methods have demonstrated strong potential in hyperspectral pansharpening by modeling long-range dependencies. However, their effectiveness is often limited by redundant token representations and a lack of multi-scale feature modeling. Hyperspectral images exhibit intrinsic spectral priors (e.g., abundance sparsity) and spatial priors (e.g., non-local similarity), which are critical for accurate reconstruction. From a spectral-spatial perspective, Vision Transformers (ViTs) face two major limitations: they struggle to preserve high-frequency components--such as material edges and texture transitions--and suffer from attention dispersion across redundant tokens. These issues stem from the global self-attention mechanism, which tends to dilute high-frequency signals and overlook localized details. To address these challenges, we propose the Token-wise High-frequency Augmentation Transformer (THAT), a novel framework designed to enhance hyperspectral pansharpening through improved high-frequency feature representation and token selection. Specifically, THAT introduces: (1) Pivotal Token Selective Attention (PTSA) to prioritize informative tokens and suppress redundancy; (2) a Multi-level Variance-aware Feed-forward Network (MVFN) to enhance high-frequency detail learning. Experiments on standard benchmarks show that THAT achieves state-of-the-art performance with improved reconstruction quality and efficiency. The source code is available at https://github.com/kailuo93/THAT.