🤖 AI Summary
To address the scarcity of large-scale, high-quality image-text datasets hindering semantic understanding of Synthetic Aperture Radar (SAR) imagery, this work introduces SAR-Text—the first SAR image–text dataset comprising over 130,000 samples. We further propose SAR-Narrator, a generative framework leveraging multi-stage progressive transfer learning to integrate SAR-specific domain priors with vision-language foundation models (e.g., CLIP, CoCa, GPT), enabling high-fidelity caption generation. The framework is designed for strong extensibility, facilitating community-driven dataset expansion. Experiments demonstrate substantial improvements: average recall in cross-modal retrieval increases by 16.43%; captioning performance achieves BLEU-4, SPICE, and CIDEr scores 8×, 4×, and 10× higher than baseline methods, respectively; and on SAR Visual Question Answering (SAR-VQA), the approach exhibits significantly enhanced semantic comprehension and reasoning capability.
📝 Abstract
Vision Language Models (VLMs) have achieved remarkable breakthroughs in the field of remote sensing in recent years. Synthetic Aperture Radar (SAR) imagery, with its all-weather capability, is essential in remote sensing, yet the lack of large-scale, high-quality SAR image-text datasets hinders its semantic understanding. In this paper, we construct SAR-Text, a large-scale and high-quality dataset consisting of over 130,000 SAR image-text pairs. To construct the SAR-Text dataset, we design the SAR-Narrator framework, which generates textual descriptions for SAR images through a multi-stage progressive transfer learning strategy. To verify the effectiveness of the SAR-TEXT dataset, we conduct experiments on three typical vision-language tasks: image-text retrieval, image captioning, and visual question answering (VQA). Specifically, we construct three representative models on SAR-TEXT: SAR-RS-CLIP, SAR-RS-CoCa, and SAR-GPT. SAR-RS-CLIP achieves notable improvements in retrieval performance, boosting average recall by 16.43% and 10.54% on the OSdataset-512 and HRSID test sets, respectively. In the captioning task, SAR-RS-CoCa achieves BLEU-4, SPICE, and CIDEr scores exceeding those of the original CoCa model by more than 8x, 4x, and 10x, respectively. In the VQA task, SAR-GPT outperforms baseline and single-stage models on multiple SAR-VQA datasets, demonstrating stronger semantic understanding and reasoning ability, as further confirmed by qualitative results. It is worth noting that, as a flexible captioning tool, SAR-Narrator can be readily adopted by the community to construct larger-scale SAR image-text datasets.