A Survey on Remote Sensing Foundation Models: From Vision to Multimodality

📅 2025-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Remote sensing foundation models hold significant promise for geospatial intelligent interpretation, yet their practical deployment remains hindered by challenges including difficulty in fusing heterogeneous multi-source data, scarcity of high-quality annotations, immature cross-modal alignment mechanisms, and prohibitive computational overhead. To address these issues, this work presents a systematic survey of vision- and multimodal-based remote sensing foundation models. It is the first to comprehensively analyze multimodal alignment, cross-modal transfer, and scalability challenges across optical, SAR, LiDAR, textual, and geospatial modalities. We propose a pragmatic model evolution roadmap and an open resource ecosystem, incorporating ViT architectures, CLIP-style contrastive learning, self-supervised pretraining, and remote sensing–specific data curation strategies. A unified evaluation framework is established, synthesizing over 120 models and datasets; all resources are open-sourced via GitHub, providing an authoritative benchmark and practical guidance for remote sensing large models.

Technology Category

Application Category

📝 Abstract
The rapid advancement of remote sensing foundation models, particularly vision and multimodal models, has significantly enhanced the capabilities of intelligent geospatial data interpretation. These models combine various data modalities, such as optical, radar, and LiDAR imagery, with textual and geographic information, enabling more comprehensive analysis and understanding of remote sensing data. The integration of multiple modalities allows for improved performance in tasks like object detection, land cover classification, and change detection, which are often challenged by the complex and heterogeneous nature of remote sensing data. However, despite these advancements, several challenges remain. The diversity in data types, the need for large-scale annotated datasets, and the complexity of multimodal fusion techniques pose significant obstacles to the effective deployment of these models. Moreover, the computational demands of training and fine-tuning multimodal models require significant resources, further complicating their practical application in remote sensing image interpretation tasks. This paper provides a comprehensive review of the state-of-the-art in vision and multimodal foundation models for remote sensing, focusing on their architecture, training methods, datasets and application scenarios. We discuss the key challenges these models face, such as data alignment, cross-modal transfer learning, and scalability, while also identifying emerging research directions aimed at overcoming these limitations. Our goal is to provide a clear understanding of the current landscape of remote sensing foundation models and inspire future research that can push the boundaries of what these models can achieve in real-world applications. The list of resources collected by the paper can be found in the https://github.com/IRIP-BUAA/A-Review-for-remote-sensing-vision-language-models.
Problem

Research questions and friction points this paper is trying to address.

Enhancing geospatial data interpretation via multimodal remote sensing models
Addressing challenges in data diversity and multimodal fusion techniques
Reducing computational demands for practical remote sensing applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines optical, radar, LiDAR with text
Integrates multimodal fusion for better analysis
Reviews architectures, training methods, datasets
🔎 Similar Papers
Ziyue Huang
Ziyue Huang
The Hong Kong University of Science and Technology
H
Hongxi Yan
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China, and also with the Hangzhou Innovation Institute, Beihang University, Hangzhou 310051, China
Q
Qiqi Zhan
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China, and also with the Hangzhou Innovation Institute, Beihang University, Hangzhou 310051, China
S
Shuai Yang
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China, and also with the Hangzhou Innovation Institute, Beihang University, Hangzhou 310051, China
Mingming Zhang
Mingming Zhang
Beihang University
big data
C
Chenkai Zhang
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China, and also with the Hangzhou Innovation Institute, Beihang University, Hangzhou 310051, China
Y
YiMing Lei
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China, and also with the Hangzhou Innovation Institute, Beihang University, Hangzhou 310051, China
Z
Zeming Liu
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China, and also with the Hangzhou Innovation Institute, Beihang University, Hangzhou 310051, China
Qingjie Liu
Qingjie Liu
Professor, School of Computer Science and Engineering, Beihang University
Computer Vision and Pattern Recognition
Yunhong Wang
Yunhong Wang
Professor, School of Computer Science and Engineering, Beihang University
BiometricsPattern RecognitionImage ProcessingComputer Vision