🤖 AI Summary
Remote sensing vision-language models are hindered by the scarcity of high-quality image-text pairs, while existing synthetic data suffer from low fidelity and poor generalization. To address this, we propose MpGI, a two-stage multimodal prompting and generation framework: (1) multi-perspective caption generation combined with rule-guided relay generation using multimodal large language models; and (2) semantic consistency and descriptive richness enhancement via large language model ensemble optimization. Leveraging MpGI, we construct HQRS-IT-210K—a high-quality dataset comprising 210K remote sensing images and 1.3M diverse, accurate captions. Fine-tuning HQRS-CLIP and RS-CoCa on this data achieves new state-of-the-art performance across cross-modal retrieval and image captioning tasks. Remarkably, these models surpass prior SOTA using only 4.2% of the training data volume, and their generated captions match or exceed human annotations in quality.
📝 Abstract
The application of Vision-language foundation models (VLFMs) to remote sensing (RS) imagery has garnered significant attention due to their superior capability in various downstream tasks. A key challenge lies in the scarcity of high-quality, large-scale, image-text paired training data. Recently, several works introduced extensive image-text datasets for RS and trained their VLFMs. However, due to the rudimentary methods used for generating captions, the quality of datasets is suboptimal, requiring larger volumes of training data, while only yielding modest performance improvements. In this paper, we propose a two-stage method named MpGI(Multi-Perspective Generation and Integration) for generating high-quality text captions for RS images. Firstly, we generate distinct and detailed descriptions from different perspectives using Rule-MLLM(Multimodal Large Language Model) Relay Generation and MLLMs generation methods. Next, we utilize Large Language Models (LLMs) to integrate these diverse descriptions into comprehensive captions, capturing details from multiple perspectives. Finally, we have created the HQRS-IT-210K dataset, including about 210,000 RS images and 1.3 million captions. We fine-tuned two VLFMs using our dataset: CLIP, a discriminative model, and CoCa, an image-to-text generative model. This process resulted in our proposed HQRS-CLIP and RS-CoCa models. Experimental results demonstrate that HQRS-CLIP surpassed the previous SOTA RS CLIP model in various downstream tasks while using only 4.2% of the training data. RS-CoCa outperforms other advanced approaches across benchmark datasets and can generate captions for RS images that rival or even exceed manual annotations. Dataset, pre-trained models, and codes will be released at https://github.com/YiguoHe/HQRS-210K-and-HQRS-CLIP.