Multi-Perspective Subimage CLIP with Keyword Guidance for Remote Sensing Image-Text Retrieval

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the limitations of existing remote sensing image-text retrieval methods, which rely on coarse-grained global alignment and overlook dense multi-scale semantic content in images, while full fine-tuning incurs high computational costs and risks catastrophic forgetting. To overcome these issues, we propose MPS-CLIP, a novel framework featuring a keyword-guided multi-view subregion alignment mechanism. Specifically, keywords generated by a large language model guide SamGeo to segment semantically meaningful subregions; fine-grained alignment is then achieved through a Gated Global Attention (G²A) adapter and a dynamic maximum-response view selection strategy. The model employs lightweight adapters and is jointly optimized with hybrid contrastive and weighted triplet losses. Experiments show that MPS-CLIP achieves average recall rates of 35.18% and 48.40% on RSICD and RSITMD, respectively, significantly outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Vision-Language Pre-training (VLP) models like CLIP have significantly advanced Remote Sensing Image-Text Retrieval (RSITR). However, existing methods predominantly rely on coarse-grained global alignment, which often overlooks the dense, multi-scale semantics inherent in overhead imagery. Moreover, adapting these heavy models via full fine-tuning incurs prohibitive computational costs and risks catastrophic forgetting. To address these challenges, we propose MPS-CLIP, a parameter-efficient framework designed to shift the retrieval paradigm from global matching to keyword-guided fine-grained alignment. Specifically, we leverage a Large Language Model (LLM) to extract core semantic keywords, guiding the Segment Anything Model (SamGeo) to generate semantically relevant sub-perspectives. To efficiently adapt the frozen backbone, we introduce a Gated Global Attention (G^2A) adapter, which captures global context and long-range dependencies with minimal overhead. Furthermore, a Multi-Perspective Representation (MPR) module aggregates these local cues into robust multi-perspective embeddings. The framework is optimized via a hybrid objective combining multi-perspective contrastive and weighted triplet losses, which dynamically selects maximum-response perspectives to suppress noise and enforce precise semantic matching. Extensive experiments on the RSICD and RSITMD benchmarks demonstrate that MPS-CLIP achieves state-of-the-art performance with 35.18% and 48.40% mean Recall (mR), respectively, significantly outperforming full fine-tuning baselines and recent competitive methods. Code is available at https://github.com/Lcrucial1f/MPS-CLIP.

Problem

Research questions and friction points this paper is trying to address.

Remote Sensing Image-Text Retrieval

Coarse-grained Global Alignment

Multi-scale Semantics

Parameter-Efficient Adaptation

Catastrophic Forgetting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Keyword-Guided Alignment

Parameter-Efficient Adaptation

Multi-Perspective Representation