🤖 AI Summary
Multimodal image matching suffers from large inter-modal appearance discrepancies and scarce annotated data, leading to low matching accuracy and poor generalization in existing methods. To address these challenges, we propose a lightweight cross-modal pixel-alignment framework based on knowledge distillation. Specifically, we leverage vision foundation models (e.g., DINOv2/v3) as teachers to transfer high-dimensional semantic knowledge to a compact student network. We further introduce a modality-category cross-injection mechanism to explicitly model cross-modal correlations. Additionally, a V2I-GAN is employed to synthesize pseudo-infrared images, enriching training data diversity. Extensive experiments on multiple public benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches: it achieves substantial improvements in matching accuracy and exhibits superior generalization to unseen scenes and modalities.
📝 Abstract
Multimodal image matching seeks pixel-level correspondences between images of different modalities, crucial for cross-modal perception, fusion and analysis. However, the significant appearance differences between modalities make this task challenging. Due to the scarcity of high-quality annotated datasets, existing deep learning methods that extract modality-common features for matching perform poorly and lack adaptability to diverse scenarios. Vision Foundation Model (VFM), trained on large-scale data, yields generalizable and robust feature representations adapted to data and tasks of various modalities, including multimodal matching. Thus, we propose DistillMatch, a multimodal image matching method using knowledge distillation from VFM. DistillMatch employs knowledge distillation to build a lightweight student model that extracts high-level semantic features from VFM (including DINOv2 and DINOv3) to assist matching across modalities. To retain modality-specific information, it extracts and injects modality category information into the other modality's features, which enhances the model's understanding of cross-modal correlations. Furthermore, we design V2I-GAN to boost the model's generalization by translating visible to pseudo-infrared images for data augmentation. Experiments show that DistillMatch outperforms existing algorithms on public datasets.