🤖 AI Summary
Existing spectral foundation models are constrained by RGB-centric pretraining paradigms, limiting their adaptability to hundred-channel hyperspectral data and confining them primarily to remote sensing applications. This work introduces the first general-purpose spectral foundation model, unifying support for both proximal and remote sensing modalities, as well as multispectral and hyperspectral imaging (>100 bands). Methodologically, we propose a novel integration of spectral channel encoding, spatial-spectral joint masking, and RGB/ImageNet transfer strategies within a masked autoencoder framework—incorporating spectral-specific positional encoding and adaptive masking. Evaluated across six downstream tasks, our model achieves an average performance gain of 12.7%, attains 92% of fully supervised accuracy using only 1% labeled data, and—critically—demonstrates, for the first time, cross-modal generalization across imaging distance (proximal vs. remote) and spectral dimensionality (multispectral to hyperspectral).
📝 Abstract
Spectral imaging data acquired via multispectral and hyperspectral cameras can have hundreds of channels, where each channel records the reflectance at a specific wavelength and bandwidth. Time and resource constraints limit our ability to collect large spectral datasets, making it difficult to build and train predictive models from scratch. In the RGB domain, we can often alleviate some of the limitations of smaller datasets by using pretrained foundational models as a starting point. However, most existing foundation models are pretrained on large datasets of 3-channel RGB images, severely limiting their effectiveness when used with spectral imaging data. The few spectral foundation models that do exist usually have one of two limitations: (1) they are built and trained only on remote sensing data limiting their application in proximal spectral imaging, (2) they utilize the more widely available multispectral imaging datasets with less than 15 channels restricting their use with hundred-channel hyperspectral images. To alleviate these issues, we propose a large-scale foundational model and dataset built upon the masked autoencoder architecture that takes advantage of spectral channel encoding, spatial-spectral masking and ImageNet pretraining for an adaptable and robust model for downstream spectral imaging tasks.