LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address hallucination issues in remote sensing vision-language foundation models—stemming from inadequate long-text modeling and coarse-grained semantics in short texts—this paper introduces LRS2M, the first remote sensing image-text dataset supporting joint long- and short-text modeling, comprising 2 million multi-source annotated pairs. We further propose LRSCLIP, a Long-CLIP–based architecture featuring a dual-text loss weighting mechanism and a Keypoint-Sensitive (KPS) position-aware text expansion module to enable fine-grained image–text alignment. This work establishes the first long-text–aware image–text alignment paradigm tailored for remote sensing. Experiments demonstrate: (1) 10–20% improvement in zero-shot long-text retrieval performance; (2) superior short-text retrieval accuracy over GeoRSCLIP on RSITMD and RSICD, achieving state-of-the-art R@1 and mean Recall (mR); (3) new SOTA results in image classification (75.75% accuracy) and semantic localization (R<sub>mi</sub> = 0.7653).

Technology Category

Application Category

📝 Abstract
This study addresses the technical bottlenecks in handling long text and the"hallucination"issue caused by insufficient short text information in remote sensing vision-language foundation models (VLFM). We propose a novel vision-language foundation model, LRSCLIP, and a multimodal dataset, LRS2M. The main contributions are as follows: (1) By integrating multi-source remote sensing data and adopting a large language model labeling strategy, we construct the LRS2M dataset, which contains 2 million image-text pairs, providing both short and long texts for the first time, thus solving the problem of semantic granularity limitations in existing datasets; (2) The design of the LRSCLIP architecture based on Long-CLIP's KPS module, which extends CLIP's text processing capacity and achieves fine-grained cross-modal feature alignment through a dual-text loss weighting mechanism. Experimental results show that LRSCLIP improves retrieval accuracy by 10%-20% over the Long-CLIP baseline in the zero-shot long-text cross-modal retrieval task. For the zero-shot short-text cross-modal retrieval task, LRSCLIP achieves improvements over the current best model, GeoRSCLIP, with increases of 0.17%, 0.67%, and 0.92% in Text to Image R@1, Image to Text R@1, and mR on RSITMD, respectively, and 0.04%, 2.93%, and 1.28% on RSICD. In the zero-shot image classification task (average accuracy=75.75%) and semantic localization task (Rmi=0.7653), LRSCLIP achieves state-of-the-art performance. These results validate the dual advantages of fine-grained semantic understanding and global feature matching in LRSCLIP. This work provides a new benchmark model and data support for remote sensing multimodal learning. The related code has been open source and is available at https://github.com/MitsuiChen14/LRSCLIP.
Problem

Research questions and friction points this paper is trying to address.

Aligning remote sensing images with long text descriptions
Reducing hallucination from insufficient short text information
Improving cross-modal retrieval accuracy in vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

LRSCLIP model extends CLIP for long-text alignment
LRS2M dataset includes 2M image-text pairs
Dual-text loss weighting enhances cross-modal retrieval
🔎 Similar Papers
No similar papers found.
W
Weizhi Chen
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China, and also with School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
J
Jingbo Chen
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China
Yupeng Deng
Yupeng Deng
aircas
Jiansheng Chen
Jiansheng Chen
School of Computer and Communication Engineering, University of Science and Technology Beijing
Computer VisionMachine Learning
Y
Yuman Feng
School of Information Network Security, People’s Public Security University of China, Beijing 100038, China
Z
Zhihao Xi
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China
D
Diyou Liu
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China
K
Kai Li
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China, and also with School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
Y
Yu Meng
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China