🤖 AI Summary
To address weak generalization and high domain adaptation costs in optical-SAR cross-modal image matching, this paper proposes a fine-tuning-free zero-shot matching method leveraging land-use semantic priors. We design a text-prompt-guided, modality-invariant descriptor construction framework. To our knowledge, this is the first work to jointly exploit diffusion models (Stable Diffusion) and vision foundation models (ViT/CLIP) for cross-modal feature alignment. Interpretable semantic text prompts drive modality-agnostic representation learning, while a multi-granularity feature aggregation module enhances cross-domain robustness. Evaluated on four heterogeneous regional datasets, our method achieves over 12% improvement in unseen-domain matching mAP, significantly outperforming state-of-the-art approaches. It demonstrates strong cross-domain generalization and plug-and-play zero-shot deployability without domain-specific adaptation.
📝 Abstract
The ideal goal of image matching is to achieve stable and efficient performance in unseen domains. However, many existing learning-based optical-SAR image matching methods, despite their effectiveness in specific scenarios, exhibit limited generalization and struggle to adapt to practical applications. Repeatedly training or fine-tuning matching models to address domain differences is not only not elegant enough but also introduces additional computational overhead and data production costs. In recent years, general foundation models have shown great potential for enhancing generalization. However, the disparity in visual domains between natural and remote sensing images poses challenges for their direct application. Therefore, effectively leveraging foundation models to improve the generalization of optical-SAR image matching remains challenge. To address the above challenges, we propose PromptMID, a novel approach that constructs modality-invariant descriptors using text prompts based on land use classification as priors information for optical and SAR image matching. PromptMID extracts multi-scale modality-invariant features by leveraging pre-trained diffusion models and visual foundation models (VFMs), while specially designed feature aggregation modules effectively fuse features across different granularities. Extensive experiments on optical-SAR image datasets from four diverse regions demonstrate that PromptMID outperforms state-of-the-art matching methods, achieving superior results in both seen and unseen domains and exhibiting strong cross-domain generalization capabilities. The source code will be made publicly available https://github.com/HanNieWHU/PromptMID.