PromptMID: Modal Invariant Descriptors Based on Diffusion and Vision Foundation Models for Optical-SAR Image Matching

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak generalization and high domain adaptation costs in optical-SAR cross-modal image matching, this paper proposes a fine-tuning-free zero-shot matching method leveraging land-use semantic priors. We design a text-prompt-guided, modality-invariant descriptor construction framework. To our knowledge, this is the first work to jointly exploit diffusion models (Stable Diffusion) and vision foundation models (ViT/CLIP) for cross-modal feature alignment. Interpretable semantic text prompts drive modality-agnostic representation learning, while a multi-granularity feature aggregation module enhances cross-domain robustness. Evaluated on four heterogeneous regional datasets, our method achieves over 12% improvement in unseen-domain matching mAP, significantly outperforming state-of-the-art approaches. It demonstrates strong cross-domain generalization and plug-and-play zero-shot deployability without domain-specific adaptation.

Technology Category

Application Category

📝 Abstract
The ideal goal of image matching is to achieve stable and efficient performance in unseen domains. However, many existing learning-based optical-SAR image matching methods, despite their effectiveness in specific scenarios, exhibit limited generalization and struggle to adapt to practical applications. Repeatedly training or fine-tuning matching models to address domain differences is not only not elegant enough but also introduces additional computational overhead and data production costs. In recent years, general foundation models have shown great potential for enhancing generalization. However, the disparity in visual domains between natural and remote sensing images poses challenges for their direct application. Therefore, effectively leveraging foundation models to improve the generalization of optical-SAR image matching remains challenge. To address the above challenges, we propose PromptMID, a novel approach that constructs modality-invariant descriptors using text prompts based on land use classification as priors information for optical and SAR image matching. PromptMID extracts multi-scale modality-invariant features by leveraging pre-trained diffusion models and visual foundation models (VFMs), while specially designed feature aggregation modules effectively fuse features across different granularities. Extensive experiments on optical-SAR image datasets from four diverse regions demonstrate that PromptMID outperforms state-of-the-art matching methods, achieving superior results in both seen and unseen domains and exhibiting strong cross-domain generalization capabilities. The source code will be made publicly available https://github.com/HanNieWHU/PromptMID.
Problem

Research questions and friction points this paper is trying to address.

Improves optical-SAR image matching generalization
Reduces domain adaptation computational overhead
Leverages foundation models for cross-domain efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses text prompts for descriptors
Leverages diffusion and visual models
Features multi-scale invariant aggregation
🔎 Similar Papers
No similar papers found.
H
Han Nie
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China
B
Bin Luo
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China
J
Jun Liu
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China
Z
Zhitao Fu
Faculty of Land Resources Engineering, Kunming University of Science and Technology, Kunming 650031, China
Huan Zhou
Huan Zhou
Northwestern Polytechnical University
Mobile Edge ComputingFederated LearningMobile Social NetworksVANETsData Offloading
S
Shuo Zhang
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China
W
Weixing Liu
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China