PromptMID: Modal Invariant Descriptors Based on Diffusion and Vision Foundation Models for Optical-SAR Image Matching

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

career value

253K/year

🤖 AI Summary

To address weak generalization and high domain adaptation costs in optical-SAR cross-modal image matching, this paper proposes a fine-tuning-free zero-shot matching method leveraging land-use semantic priors. We design a text-prompt-guided, modality-invariant descriptor construction framework. To our knowledge, this is the first work to jointly exploit diffusion models (Stable Diffusion) and vision foundation models (ViT/CLIP) for cross-modal feature alignment. Interpretable semantic text prompts drive modality-agnostic representation learning, while a multi-granularity feature aggregation module enhances cross-domain robustness. Evaluated on four heterogeneous regional datasets, our method achieves over 12% improvement in unseen-domain matching mAP, significantly outperforming state-of-the-art approaches. It demonstrates strong cross-domain generalization and plug-and-play zero-shot deployability without domain-specific adaptation.

Technology Category

Application Category

📝 Abstract

The ideal goal of image matching is to achieve stable and efficient performance in unseen domains. However, many existing learning-based optical-SAR image matching methods, despite their effectiveness in specific scenarios, exhibit limited generalization and struggle to adapt to practical applications. Repeatedly training or fine-tuning matching models to address domain differences is not only not elegant enough but also introduces additional computational overhead and data production costs. In recent years, general foundation models have shown great potential for enhancing generalization. However, the disparity in visual domains between natural and remote sensing images poses challenges for their direct application. Therefore, effectively leveraging foundation models to improve the generalization of optical-SAR image matching remains challenge. To address the above challenges, we propose PromptMID, a novel approach that constructs modality-invariant descriptors using text prompts based on land use classification as priors information for optical and SAR image matching. PromptMID extracts multi-scale modality-invariant features by leveraging pre-trained diffusion models and visual foundation models (VFMs), while specially designed feature aggregation modules effectively fuse features across different granularities. Extensive experiments on optical-SAR image datasets from four diverse regions demonstrate that PromptMID outperforms state-of-the-art matching methods, achieving superior results in both seen and unseen domains and exhibiting strong cross-domain generalization capabilities. The source code will be made publicly available https://github.com/HanNieWHU/PromptMID.

Problem

Research questions and friction points this paper is trying to address.

Improves optical-SAR image matching generalization

Reduces domain adaptation computational overhead

Leverages foundation models for cross-domain efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses text prompts for descriptors

Leverages diffusion and visual models

Features multi-scale invariant aggregation

🔎 Similar Papers

Conditional Brownian Bridge Diffusion Model for VHR SAR to Optical Image Translation