🤖 AI Summary
This survey addresses key challenges in remote sensing multimodal understanding—namely, complex land-cover semantics, abstract geospatial concepts, and scarce annotated data—by systematically reviewing advances in vision-language models (VLMs) for remote sensing. We propose a unified taxonomy categorizing VLM enhancements along three core components: encoder, aligner, and decoder. To rigorously assess progress, we introduce the first remote sensing–specific unified evaluation framework, empirically validating the critical role of cross-modal alignment in geographic reasoning and fine-grained image description. Our methodology integrates contrastive learning, instruction tuning, cross-modal adapters, and remote sensing–informed prior embedding, leveraging high-resolution and Sentinel imagery alongside heterogeneous geotextual data. Covering over 30 works, this survey releases benchmark datasets, open-source code, and an evaluation toolkit, significantly improving VLM interpretability and practical utility in real-world applications such as disaster response and land-use classification.
📝 Abstract
Recently, the remarkable success of ChatGPT has sparked a renewed wave of interest in artificial intelligence (AI), and the advancements in visual language models (VLMs) have pushed this enthusiasm to new heights. Differring from previous AI approaches that generally formulated different tasks as discriminative models, VLMs frame tasks as generative models and align language with visual information, enabling the handling of more challenging problems. The remote sensing (RS) field, a highly practical domain, has also embraced this new trend and introduced several VLM-based RS methods that have demonstrated promising performance and enormous potential. In this paper, we first review the fundamental theories related to VLM, then summarize the datasets constructed for VLMs in remote sensing and the various tasks they addressed. Finally, we categorize the improvement methods into three main parts according to the core components of VLMs and provide a detailed introduction and comparison of these methods. A project associated with this review has been created at https://github.com/taolijie11111/VLMs-in-RS-review.