🤖 AI Summary
To address data scarcity, high training costs, and challenges in joint spatial-spectral modeling under few-shot conditions in hyperspectral target tracking, this paper proposes a lightweight and efficient Transformer adaptation framework tailored for snapshot hyperspectral tracking. Methodologically, it introduces: (1) a learnable spatial-spectral token fusion module that explicitly models cross-dimensional feature interactions; (2) cross-modal knowledge distillation from large-scale pretrained Vision Transformers (ViTs), coupled with adaptive token alignment, enabling zero-shot transfer; and (3) a plug-and-play fine-tuning strategy requiring only minimal labeled data and iterations for any Transformer backbone. Under scarce annotation settings, the framework significantly improves tracking accuracy and spectral discriminability robustness, achieving state-of-the-art performance on benchmark hyperspectral tracking datasets.
📝 Abstract
Hyperspectral object tracking using snapshot mosaic cameras is emerging as it provides enhanced spectral information alongside spatial data, contributing to a more comprehensive understanding of material properties. Using transformers, which have consistently outperformed convolutional neural networks (CNNs) in learning better feature representations, would be expected to be effective for Hyperspectral object tracking. However, training large transformers necessitates extensive datasets and prolonged training periods. This is particularly critical for complex tasks like object tracking, and the scarcity of large datasets in the hyperspectral domain acts as a bottleneck in achieving the full potential of powerful transformer models. This paper proposes an effective methodology that adapts large pretrained transformer-based foundation models for hyperspectral object tracking. We propose an adaptive, learnable spatial-spectral token fusion module that can be extended to any transformer-based backbone for learning inherent spatial-spectral features in hyperspectral data. Furthermore, our model incorporates a cross-modality training pipeline that facilitates effective learning across hyperspectral datasets collected with different sensor modalities. This enables the extraction of complementary knowledge from additional modalities, whether or not they are present during testing. Our proposed model also achieves superior performance with minimal training iterations.