🤖 AI Summary
To address hallucination issues in remote sensing vision-language foundation models—stemming from inadequate long-text modeling and coarse-grained semantics in short texts—this paper introduces LRS2M, the first remote sensing image-text dataset supporting joint long- and short-text modeling, comprising 2 million multi-source annotated pairs. We further propose LRSCLIP, a Long-CLIP–based architecture featuring a dual-text loss weighting mechanism and a Keypoint-Sensitive (KPS) position-aware text expansion module to enable fine-grained image–text alignment. This work establishes the first long-text–aware image–text alignment paradigm tailored for remote sensing. Experiments demonstrate: (1) 10–20% improvement in zero-shot long-text retrieval performance; (2) superior short-text retrieval accuracy over GeoRSCLIP on RSITMD and RSICD, achieving state-of-the-art R@1 and mean Recall (mR); (3) new SOTA results in image classification (75.75% accuracy) and semantic localization (R<sub>mi</sub> = 0.7653).
📝 Abstract
This study addresses the technical bottlenecks in handling long text and the"hallucination"issue caused by insufficient short text information in remote sensing vision-language foundation models (VLFM). We propose a novel vision-language foundation model, LRSCLIP, and a multimodal dataset, LRS2M. The main contributions are as follows: (1) By integrating multi-source remote sensing data and adopting a large language model labeling strategy, we construct the LRS2M dataset, which contains 2 million image-text pairs, providing both short and long texts for the first time, thus solving the problem of semantic granularity limitations in existing datasets; (2) The design of the LRSCLIP architecture based on Long-CLIP's KPS module, which extends CLIP's text processing capacity and achieves fine-grained cross-modal feature alignment through a dual-text loss weighting mechanism. Experimental results show that LRSCLIP improves retrieval accuracy by 10%-20% over the Long-CLIP baseline in the zero-shot long-text cross-modal retrieval task. For the zero-shot short-text cross-modal retrieval task, LRSCLIP achieves improvements over the current best model, GeoRSCLIP, with increases of 0.17%, 0.67%, and 0.92% in Text to Image R@1, Image to Text R@1, and mR on RSITMD, respectively, and 0.04%, 2.93%, and 1.28% on RSICD. In the zero-shot image classification task (average accuracy=75.75%) and semantic localization task (Rmi=0.7653), LRSCLIP achieves state-of-the-art performance. These results validate the dual advantages of fine-grained semantic understanding and global feature matching in LRSCLIP. This work provides a new benchmark model and data support for remote sensing multimodal learning. The related code has been open source and is available at https://github.com/MitsuiChen14/LRSCLIP.