Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
To address cross-modal optimization imbalance in remote sensing image–text retrieval (RSITR), where text modality dominance suppresses visual representation learning during vision–language pretraining (VLP) fine-tuning, this paper proposes a cross-modal asymmetric adapter architecture and a dual-task consistency loss—enabling, for the first time in remote sensing VLP fine-tuning, modality-specific, parameter-efficient optimization with robust cross-modal alignment. The method integrates differential attention, hierarchical attention, parameter-efficient fine-tuning (PEFT), joint dual-task optimization, and exponential moving average–based consistency regularization. Evaluated on RSICD and RSITMD benchmarks, it achieves mean Recall (mR) improvements of 6%–11% over state-of-the-art PEFT methods and surpasses the full-parameter fine-tuning baseline GeoRSCLIP by 1.15%–2.0%, demonstrating superior efficiency and effectiveness in modality-balanced representation learning.

Technology Category

Application Category

📝 Abstract
Remote Sensing Image-Text Retrieval (RSITR) plays a critical role in geographic information interpretation, disaster monitoring, and urban planning by establishing semantic associations between image and textual descriptions. Existing Parameter-Efficient Fine-Tuning (PEFT) methods for Vision-and-Language Pre-training (VLP) models typically adopt symmetric adapter structures for exploring cross-modal correlations. However, the strong discriminative nature of text modality may dominate the optimization process and inhibits image representation learning. The nonnegligible imbalanced cross-modal optimization remains a bottleneck to enhancing the model performance. To address this issue, this study proposes a Representation Discrepancy Bridging (RDB) method for the RSITR task. On the one hand, a Cross-Modal Asymmetric Adapter (CMAA) is designed to enable modality-specific optimization and improve feature alignment. The CMAA comprises a Visual Enhancement Adapter (VEA) and a Text Semantic Adapter (TSA). VEA mines fine-grained image features by Differential Attention (DA) mechanism, while TSA identifies key textual semantics through Hierarchical Attention (HA) mechanism. On the other hand, this study extends the traditional single-task retrieval framework to a dual-task optimization framework and develops a Dual-Task Consistency Loss (DTCL). The DTCL improves cross-modal alignment robustness through an adaptive weighted combination of cross-modal, classification, and exponential moving average consistency constraints. Experiments on RSICD and RSITMD datasets show that the proposed RDB method achieves a 6%-11% improvement in mR metrics compared to state-of-the-art PEFT methods and a 1.15%-2% improvement over the full fine-tuned GeoRSCLIP model.
Problem

Research questions and friction points this paper is trying to address.

Bridges representation discrepancy in remote sensing image-text retrieval
Addresses imbalanced cross-modal optimization in VLP models
Enhances feature alignment via asymmetric adapters and dual-task loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Asymmetric Adapter for modality-specific optimization
Dual-Task Consistency Loss for robust alignment
Differential and Hierarchical Attention mechanisms for feature mining
🔎 Similar Papers
2024-09-20IEEE Transactions on Geoscience and Remote SensingCitations: 2
Hailong Ning
Hailong Ning
University of Illinois Urbana Champaign
batteriesenergy storageenergy harvestingfoam
Siying Wang
Siying Wang
University of Electronic Science and Technology of China
reinforcement learninmulti-agent reinforcement learningoffline-to-online reinforcement learning
T
Tao Lei
Shaanxi Joint Laboratory of Artificial Intelligence and the School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China
X
Xiaopeng Cao
School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an 710121, China; Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, Xi’an 710121, China; Xi’an Key Laboratory of Big Data and Intelligent Computing, Xi’an 710121, China
H
Huanmin Dou
School of Computer Engineering, Weifang University, Shandong 261061, China
B
Bin Zhao
Shanghai Artificial Intelligence Laboratory, Shanghai, 200003, China
A
Asoke K. Nandi
Department of Electronic and Electrical Engineering, Brunel University of London, Uxbridge, UB8 3PH, United Kingdom
P
Petia Radeva
Dept. Matemàtiques i Informàtica and Institute of Neuroscience, Univeritat de Barcelona, Barcelona, 08007, Spain