🤖 AI Summary
Existing cross-modal foundation models are predominantly designed for RGB imagery and struggle to generalize to synthetic aperture radar (SAR) remote sensing data, hindering all-weather scene understanding. To address this, we propose SAR-VLM—the first general-purpose multimodal foundation model tailored for SAR. We introduce geospatial attributes as a core inductive bias and construct SAR-GEOVL-1M, the first large-scale, geographically projected SAR multimodal dataset. Leveraging a hierarchical cognitive chain-of-thought framework, we generate structured textual annotations. SAR-VLM employs joint contrastive–matching–reconstruction learning with self-consistent iterative optimization, enabling closed-loop self-supervised training on transferable encoders. Evaluated across 11 downstream tasks—including object counting and land-cover classification—SAR-VLM consistently outperforms 14 state-of-the-art baselines, achieving significant gains in key benchmarks. We further establish the first unified multimodal evaluation benchmark for SAR.
📝 Abstract
Cross-modal artificial intelligence has garnered widespread attention in recent years, achieving significant progress in the study of natural images. However, existing methods are mostly designed for RGB imagery, leaving a significant gap in modeling synthetic aperture radar (SAR) imagery. SAR, with its all-day, all-weather imaging capabilities, plays an irreplaceable role in remote sensing scene understanding. To address this gap, this paper proposes SAR-KnowLIP, the first universal SAR multimodal foundational model, along with reusable data and evaluation baselines. Specifically: (1) This work introduces the critical yet long-overlooked attribute of geographic information into remote sensing research, constructing SAR-GEOVL-1M (the first large-scale SAR dataset with complete geographic projection properties), covering multiple satellite platforms, 120,000 images, and 135 cities. (2) Aligned structured text is generated through a hierarchical cognitive chain-of-thought (HCoT), providing more than one million multi-dimensional semantic annotations of landforms, regional functions, target attributes, and spatial relationships. (3) We design a Self-Consistent Iterative Optimization mechanism that continuously enhances cross-modal alignment through a self-supervised closed loop of contrastive, matching, and reconstruction learning on a transferable multimodal encoder. (4) A unified evaluation benchmark is established across 11 representative downstream vision and vision-language tasks, with comparisons against 14 leading foundation models, where SAR-KnowLIP demonstrates leading performance, particularly in object counting and land-cover classification. We expect that SAR-KnowLIP's large-scale multimodal data, transferable model architecture, and comprehensive experimental benchmark will significantly advance the development of SAR multimodal baseline models.