TAR: Text Semantic Assisted Cross-modal Image Registration Framework for Optical and SAR Images

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenging problem of registering optical and synthetic aperture radar (SAR) images under large geometric deformations, where significant modality discrepancies and complex spatial transformations hinder accurate alignment. To bridge this modality gap, the authors propose the TAR framework, which leverages a frozen RemoteCLIP text encoder to exploit textual semantic priors of remote sensing scenes and land-cover categories. TAR integrates three key components: multi-scale visual feature learning, text-aided feature enhancement, and coarse-to-fine dense matching, enabling effective vision–language interaction. Experimental results demonstrate that TAR substantially outperforms existing methods in cross-modal remote sensing image registration, exhibiting superior robustness and matching accuracy, particularly under conditions of substantial geometric distortion.

📝 Abstract

Existing deep learning-based methods can capture shared features from optical and synthetic aperture radar (SAR) images for spatial alignment. However, optical-SAR registration remains challenging under large geometric deformations, because the model needs to simultaneously handle cross-modal appearance discrepancies and complex spatial transformations. To address this issue, this paper proposes a text semantic-assisted cross-modal image registration framework, named TAR, for optical and SAR images. TAR exploits text semantic priors from remote sensing scenes and land-cover categories to alleviate the modality gap and enhance cross-modal feature learning. TAR consists of three components: a multi-scale visual feature learning (MSFL) module, a text-assisted feature enhancement (TAFE) module, and a coarse-to-fine dense matching (CFDM) module. MSFL extracts multi-scale visual features from optical and SAR images. TAFE constructs text descriptors related to remote sensing scenes and land-cover objects, and uses a frozen RemoteCLIP text encoder to extract text features. These text features are introduced through visual-text interaction to enhance high-level visual features for more reliable coarse matching. CFDM then establishes coarse correspondences based on the enhanced high-level features and refines the matched locations using low-level features. Experimental results on cross-modal remote sensing images demonstrate the effectiveness of TAR, which achieves stronger matching performance than several state-of-the-art methods and yields significant gains under large geometric deformations.

Problem

Research questions and friction points this paper is trying to address.

cross-modal registration

optical-SAR images

large geometric deformations

modality gap

image alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

text semantic prior

cross-modal registration

optical-SAR image alignment