A Survey on Remote Sensing Foundation Models: From Vision to Multimodality

📅 2025-03-28

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Remote sensing foundation models hold significant promise for geospatial intelligent interpretation, yet their practical deployment remains hindered by challenges including difficulty in fusing heterogeneous multi-source data, scarcity of high-quality annotations, immature cross-modal alignment mechanisms, and prohibitive computational overhead. To address these issues, this work presents a systematic survey of vision- and multimodal-based remote sensing foundation models. It is the first to comprehensively analyze multimodal alignment, cross-modal transfer, and scalability challenges across optical, SAR, LiDAR, textual, and geospatial modalities. We propose a pragmatic model evolution roadmap and an open resource ecosystem, incorporating ViT architectures, CLIP-style contrastive learning, self-supervised pretraining, and remote sensing–specific data curation strategies. A unified evaluation framework is established, synthesizing over 120 models and datasets; all resources are open-sourced via GitHub, providing an authoritative benchmark and practical guidance for remote sensing large models.

Technology Category

Application Category

📝 Abstract

The rapid advancement of remote sensing foundation models, particularly vision and multimodal models, has significantly enhanced the capabilities of intelligent geospatial data interpretation. These models combine various data modalities, such as optical, radar, and LiDAR imagery, with textual and geographic information, enabling more comprehensive analysis and understanding of remote sensing data. The integration of multiple modalities allows for improved performance in tasks like object detection, land cover classification, and change detection, which are often challenged by the complex and heterogeneous nature of remote sensing data. However, despite these advancements, several challenges remain. The diversity in data types, the need for large-scale annotated datasets, and the complexity of multimodal fusion techniques pose significant obstacles to the effective deployment of these models. Moreover, the computational demands of training and fine-tuning multimodal models require significant resources, further complicating their practical application in remote sensing image interpretation tasks. This paper provides a comprehensive review of the state-of-the-art in vision and multimodal foundation models for remote sensing, focusing on their architecture, training methods, datasets and application scenarios. We discuss the key challenges these models face, such as data alignment, cross-modal transfer learning, and scalability, while also identifying emerging research directions aimed at overcoming these limitations. Our goal is to provide a clear understanding of the current landscape of remote sensing foundation models and inspire future research that can push the boundaries of what these models can achieve in real-world applications. The list of resources collected by the paper can be found in the https://github.com/IRIP-BUAA/A-Review-for-remote-sensing-vision-language-models.

Problem

Research questions and friction points this paper is trying to address.

Enhancing geospatial data interpretation via multimodal remote sensing models

Addressing challenges in data diversity and multimodal fusion techniques

Reducing computational demands for practical remote sensing applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines optical, radar, LiDAR with text

Integrates multimodal fusion for better analysis

Reviews architectures, training methods, datasets

🔎 Similar Papers

Vision Foundation Models in Remote Sensing: A Survey