Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address cross-modal optimization imbalance in remote sensing image–text retrieval (RSITR), where text modality dominance suppresses visual representation learning during vision–language pretraining (VLP) fine-tuning, this paper proposes a cross-modal asymmetric adapter architecture and a dual-task consistency loss—enabling, for the first time in remote sensing VLP fine-tuning, modality-specific, parameter-efficient optimization with robust cross-modal alignment. The method integrates differential attention, hierarchical attention, parameter-efficient fine-tuning (PEFT), joint dual-task optimization, and exponential moving average–based consistency regularization. Evaluated on RSICD and RSITMD benchmarks, it achieves mean Recall (mR) improvements of 6%–11% over state-of-the-art PEFT methods and surpasses the full-parameter fine-tuning baseline GeoRSCLIP by 1.15%–2.0%, demonstrating superior efficiency and effectiveness in modality-balanced representation learning.

Technology Category

Application Category

📝 Abstract
Remote Sensing Image-Text Retrieval (RSITR) plays a critical role in geographic information interpretation, disaster monitoring, and urban planning by establishing semantic associations between image and textual descriptions. Existing Parameter-Efficient Fine-Tuning (PEFT) methods for Vision-and-Language Pre-training (VLP) models typically adopt symmetric adapter structures for exploring cross-modal correlations. However, the strong discriminative nature of text modality may dominate the optimization process and inhibits image representation learning. The nonnegligible imbalanced cross-modal optimization remains a bottleneck to enhancing the model performance. To address this issue, this study proposes a Representation Discrepancy Bridging (RDB) method for the RSITR task. On the one hand, a Cross-Modal Asymmetric Adapter (CMAA) is designed to enable modality-specific optimization and improve feature alignment. The CMAA comprises a Visual Enhancement Adapter (VEA) and a Text Semantic Adapter (TSA). VEA mines fine-grained image features by Differential Attention (DA) mechanism, while TSA identifies key textual semantics through Hierarchical Attention (HA) mechanism. On the other hand, this study extends the traditional single-task retrieval framework to a dual-task optimization framework and develops a Dual-Task Consistency Loss (DTCL). The DTCL improves cross-modal alignment robustness through an adaptive weighted combination of cross-modal, classification, and exponential moving average consistency constraints. Experiments on RSICD and RSITMD datasets show that the proposed RDB method achieves a 6%-11% improvement in mR metrics compared to state-of-the-art PEFT methods and a 1.15%-2% improvement over the full fine-tuned GeoRSCLIP model.
Problem

Research questions and friction points this paper is trying to address.

Bridges representation discrepancy in remote sensing image-text retrieval
Addresses imbalanced cross-modal optimization in VLP models
Enhances feature alignment via asymmetric adapters and dual-task loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Asymmetric Adapter for modality-specific optimization
Dual-Task Consistency Loss for robust alignment
Differential and Hierarchical Attention mechanisms for feature mining
🔎 Similar Papers
No similar papers found.
Hailong Ning
Hailong Ning
University of Illinois Urbana Champaign
batteriesenergy storageenergy harvestingfoam
Siying Wang
Siying Wang
University of Electronic Science and Technology of China
reinforcement learninmulti-agent reinforcement learningoffline-to-online reinforcement learning
T
Tao Lei
Shaanxi Joint Laboratory of Artificial Intelligence and the School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China
X
Xiaopeng Cao
School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an 710121, China; Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, Xi’an 710121, China; Xi’an Key Laboratory of Big Data and Intelligent Computing, Xi’an 710121, China
H
Huanmin Dou
School of Computer Engineering, Weifang University, Shandong 261061, China
B
Bin Zhao
Shanghai Artificial Intelligence Laboratory, Shanghai, 200003, China
A
Asoke K. Nandi
Department of Electronic and Electrical Engineering, Brunel University of London, Uxbridge, UB8 3PH, United Kingdom
P
Petia Radeva
Dept. Matemàtiques i Informàtica and Institute of Neuroscience, Univeritat de Barcelona, Barcelona, 08007, Spain