SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing foundation models for Earth observation struggle to effectively incorporate hyperspectral imagery (HSI), while specialized HSI models lack joint pretraining with multimodal remote sensing data. This work proposes a hierarchical Transformer architecture that, for the first time, enables unified pretraining of HSI alongside multispectral and SAR data through spectral tokenization, sensor-specific encoders, and a cross-sensor fusion module. The authors also introduce SpectralEarth-MM, a large-scale co-located multimodal dataset. Leveraging a JEPA-style joint embedding prediction objective, the model achieves state-of-the-art performance on both hyperspectral downstream tasks and general Earth observation benchmarks, significantly enhancing its generalization and multimodal fusion capabilities.

📝 Abstract

Earth observation (EO) foundation models (FMs) are increasingly trained on multisensor data, spanning multispectral imagery (MSI), synthetic aperture radar (SAR), and derived geospatial layers, but hyperspectral imagery (HSI) remains underrepresented. Conversely, existing hyperspectral FMs are trained on HSI alone, leaving joint pretraining and fusion of HSI with co-located EO sensors unexplored. We introduce SpectralEarth-FM, a hierarchical transformer for multisensor EO input with heterogeneous spectral dimensionality. The architecture combines spectral tokenization for hyperspectral inputs, sensor-specific encoders, a cross-sensor fusion module, and a shared hierarchical encoder, enabling joint processing of HSI and lower-channel observations. To pretrain SpectralEarth-FM, we curate SpectralEarth-MM, a dataset that co-locates HSI from three spaceborne sensors (EnMAP, EMIT, DESIS) with Sentinel-2, Landsat-8/9 optical imagery, Landsat land surface temperature (LST), and Sentinel-1 SAR, over common geographic footprints. It comprises approximately 2M globally distributed locations, 25M georeferenced patches, and over 40TB of data. Pretraining uses a Joint-Embedding Predictive Architecture (JEPA)-style objective that matches representations between global views and single-sensor local views from the same location. We evaluate SpectralEarth-FM on hyperspectral downstream tasks and standard EO benchmarks following the PANGAEA protocol, achieving state-of-the-art results across both evaluation settings.

Problem

Research questions and friction points this paper is trying to address.

hyperspectral imagery

foundation models

multimodal Earth observation

sensor fusion

pretraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

hyperspectral imagery

multimodal fusion

foundation model