🤖 AI Summary
To address scalability limitations, high computational cost, and inflexible modality- and channel-wise expansion in multimodal hyperspectral remote sensing image modeling, this paper proposes LESS ViT—a scalable foundation model. Methodologically, it introduces a low-rank efficient spatial-spectral attention block, a continuous position-channel embedding layer with perceptual field masking—jointly preserving physical continuity, capturing local dependencies, and ensuring parameter efficiency—alongside hyperspectral masked autoencoding pretraining, Kronecker-product-based attention approximation, and joint spatial-spectral embedding. Evaluated on the newly established benchmark GFM-Bench, LESS ViT achieves state-of-the-art accuracy while reducing parameters by 32% and FLOPs by 41%. Moreover, it supports flexible, plug-and-play modality expansion, enabling adaptive integration of new spectral or auxiliary modalities without architectural redesign.
📝 Abstract
Geospatial raster (imagery) data, such as that collected by satellite-based imaging systems at different times and spectral bands, hold immense potential for enabling a wide range of high-impact applications. This potential stems from the rich information that is spatially and temporally contextualized across multiple channels and sensing modalities. Recent work has adapted existing self-supervised learning approaches for such geospatial data. However, they fall short of scalable model architectures, leading to inflexibility and computational inefficiencies when faced with an increasing number of channels and modalities. To address these limitations, we introduce Low-rank Efficient Spatial-Spectral Vision Transformer (LESS ViT) with three key innovations: i) the LESS Attention Block that approximates high-dimensional spatial-spectral attention through Kronecker's product of the low-dimensional spatial and spectral attention components; ii) the Continuous Positional-Channel Embedding Layer that preserves both spatial and spectral continuity and physical characteristics of each patch; and iii) the Perception Field Mask that exploits local spatial dependencies by constraining attention to neighboring patches. To evaluate the proposed innovations, we construct a benchmark, GFM-Bench, which serves as a comprehensive benchmark for such geospatial raster data. We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies. Experimental results demonstrate that our proposed method surpasses current state-of-the-art multi-modal geospatial foundation models, achieving superior performance with less computation and fewer parameters. The flexibility and extensibility of our framework make it a promising direction for future geospatial data analysis tasks that involve a wide range of modalities and channels.