TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

πŸ“… 2025-06-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current deep learning models for Earth observation suffer from limited generalizability due to insufficient scale, geographic coverage, and spectral diversity of training data. To address this, we propose the first foundation model for multi-source remote sensing, jointly modeling Sentinel-1 (SAR) and Sentinel-2 (optical) imagery. Our method introduces a modality-aware self-supervised framework featuring land-cover-guided sampling, dual-center contrastive learning, and class-frequency-aware regularization to mitigate long-tailed label distributions; it further employs modality-specific patch embeddings, adaptive cross-modal attention fusion, and local-global contrastive learning. Evaluated on GEO-Bench and Copernicus-Bench, our model achieves state-of-the-art performance across classification and segmentation tasks, significantly improving zero-shot and few-shot transfer capabilities. It enhances the global transferability and geographic robustness of land-cover representations, enabling more reliable and scalable Earth observation analysis.

Technology Category

Application Category

πŸ“ Abstract
Modern Earth observation (EO) increasingly leverages deep learning to harness the scale and diversity of satellite imagery across sensors and regions. While recent foundation models have demonstrated promising generalization across EO tasks, many remain limited by the scale, geographical coverage, and spectral diversity of their training data, factors critical for learning globally transferable representations. In this work, we introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery, combined with large spatial tiles and land-cover aware sampling to enrich spatial and semantic coverage. By treating sensing modalities as natural augmentations in our self-supervised approach, we unify radar and optical inputs via modality-specific patch embeddings and adaptive cross-attention fusion. Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism that incorporates class-frequency-aware regularization to address long-tailed distributions in land cover.TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench. Our code and pretrained models are publicly available at: https://github.com/mbzuai-oryx/TerraFM .
Problem

Research questions and friction points this paper is trying to address.

Develop scalable foundation model for multisensor Earth observation
Address limited training data scale and diversity issues
Unify radar and optical inputs via self-supervised learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable self-supervised learning with global Sentinel data
Unified radar and optical inputs via adaptive fusion
Local-global contrastive learning with dual-centering mechanism
M
Muhammad Sohail Danish
Mohamed bin Zayed University of Artificial Intelligence
Muhammad Akhtar Munir
Muhammad Akhtar Munir
Mohamed bin Zayed University of Artificial Intelligence, UAE
Deep LearningModel CalibrationDomain GeneralizationVLMsRemote Sensing
S
Syed Roshaan Ali Shah
University College London
M
Muhammad Haris Khan
Mohamed bin Zayed University of Artificial Intelligence
Rao Muhammad Anwer
Rao Muhammad Anwer
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computer VisionObject Recognition
J
Jorma Laaksonen
Aalto University
Fahad Shahbaz Khan
Fahad Shahbaz Khan
MBZUAI, LinkΓΆping University Sweden
Computer VisionObject RecognitionGenerative AIAI for Science
S
Salman Khan
Mohamed bin Zayed University of Artificial Intelligence, Australian National University