SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

184K/year
🤖 AI Summary
Existing foundation models for Earth observation struggle to effectively incorporate hyperspectral imagery (HSI), while specialized HSI models lack joint pretraining with multimodal remote sensing data. This work proposes a hierarchical Transformer architecture that, for the first time, enables unified pretraining of HSI alongside multispectral and SAR data through spectral tokenization, sensor-specific encoders, and a cross-sensor fusion module. The authors also introduce SpectralEarth-MM, a large-scale co-located multimodal dataset. Leveraging a JEPA-style joint embedding prediction objective, the model achieves state-of-the-art performance on both hyperspectral downstream tasks and general Earth observation benchmarks, significantly enhancing its generalization and multimodal fusion capabilities.
📝 Abstract
Earth observation (EO) foundation models (FMs) are increasingly trained on multisensor data, spanning multispectral imagery (MSI), synthetic aperture radar (SAR), and derived geospatial layers, but hyperspectral imagery (HSI) remains underrepresented. Conversely, existing hyperspectral FMs are trained on HSI alone, leaving joint pretraining and fusion of HSI with co-located EO sensors unexplored. We introduce SpectralEarth-FM, a hierarchical transformer for multisensor EO input with heterogeneous spectral dimensionality. The architecture combines spectral tokenization for hyperspectral inputs, sensor-specific encoders, a cross-sensor fusion module, and a shared hierarchical encoder, enabling joint processing of HSI and lower-channel observations. To pretrain SpectralEarth-FM, we curate SpectralEarth-MM, a dataset that co-locates HSI from three spaceborne sensors (EnMAP, EMIT, DESIS) with Sentinel-2, Landsat-8/9 optical imagery, Landsat land surface temperature (LST), and Sentinel-1 SAR, over common geographic footprints. It comprises approximately 2M globally distributed locations, 25M georeferenced patches, and over 40TB of data. Pretraining uses a Joint-Embedding Predictive Architecture (JEPA)-style objective that matches representations between global views and single-sensor local views from the same location. We evaluate SpectralEarth-FM on hyperspectral downstream tasks and standard EO benchmarks following the PANGAEA protocol, achieving state-of-the-art results across both evaluation settings.
Problem

Research questions and friction points this paper is trying to address.

hyperspectral imagery
foundation models
multimodal Earth observation
sensor fusion
pretraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

hyperspectral imagery
multimodal fusion
foundation model
spectral tokenization
cross-sensor pretraining
🔎 Similar Papers
2024-08-15IEEE Journal of Selected Topics in Applied Earth Observations and Remote SensingCitations: 18
N
Nassim Ait Ali Braham
Chair of Data Science in Earth Observation, Technical University of Munich, Germany; Remote Sensing Technology Institute, German Aerospace Center (DLR), Germany
A
Aaron Banze
Remote Sensing Technology Institute, German Aerospace Center (DLR), Germany; Department of Aerospace Engineering, University of the Bundeswehr Munich, Germany
C
Conrad M. Albrecht
Remote Sensing Technology Institute, German Aerospace Center (DLR), Germany; LEAP, Columbia University, USA
Julien Mairal
Julien Mairal
Inria - Univ. Grenoble Alpes
machine learningartificial intelligenceoptimizationcomputer visionimage processing
Jocelyn Chanussot
Jocelyn Chanussot
INRIA, on leave from Grenoble INP
artificial intelligenceimage processingsignal processingremote sensinghyperspectral
Xiao Xiang Zhu
Xiao Xiang Zhu
Technical University of Munich
Earth ObservationAI4EOSignal ProcessingData ScienceSynthetic Aperture Radar