DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal image matching suffers from large inter-modal appearance discrepancies and scarce annotated data, leading to low matching accuracy and poor generalization in existing methods. To address these challenges, we propose a lightweight cross-modal pixel-alignment framework based on knowledge distillation. Specifically, we leverage vision foundation models (e.g., DINOv2/v3) as teachers to transfer high-dimensional semantic knowledge to a compact student network. We further introduce a modality-category cross-injection mechanism to explicitly model cross-modal correlations. Additionally, a V2I-GAN is employed to synthesize pseudo-infrared images, enriching training data diversity. Extensive experiments on multiple public benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches: it achieves substantial improvements in matching accuracy and exhibits superior generalization to unseen scenes and modalities.

Technology Category

Application Category

📝 Abstract
Multimodal image matching seeks pixel-level correspondences between images of different modalities, crucial for cross-modal perception, fusion and analysis. However, the significant appearance differences between modalities make this task challenging. Due to the scarcity of high-quality annotated datasets, existing deep learning methods that extract modality-common features for matching perform poorly and lack adaptability to diverse scenarios. Vision Foundation Model (VFM), trained on large-scale data, yields generalizable and robust feature representations adapted to data and tasks of various modalities, including multimodal matching. Thus, we propose DistillMatch, a multimodal image matching method using knowledge distillation from VFM. DistillMatch employs knowledge distillation to build a lightweight student model that extracts high-level semantic features from VFM (including DINOv2 and DINOv3) to assist matching across modalities. To retain modality-specific information, it extracts and injects modality category information into the other modality's features, which enhances the model's understanding of cross-modal correlations. Furthermore, we design V2I-GAN to boost the model's generalization by translating visible to pseudo-infrared images for data augmentation. Experiments show that DistillMatch outperforms existing algorithms on public datasets.
Problem

Research questions and friction points this paper is trying to address.

Addressing multimodal image matching challenges due to appearance differences
Overcoming poor performance from scarce annotated datasets in cross-modal matching
Enhancing generalization and adaptability across diverse multimodal scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge distillation from Vision Foundation Model
Injecting modality category information into features
Data augmentation using V2I-GAN translation
🔎 Similar Papers
No similar papers found.
M
Meng Yang
Electronic Information School, Wuhan University
F
Fan Fan
Electronic Information School, Wuhan University
Zizhuo Li
Zizhuo Li
Wuhan University
Computer VisionImage MatchingMulti-View Geometry
S
Songchu Deng
Electronic Information School, Wuhan University
Yong Ma
Yong Ma
Wuhan University
Infrared image processingremote sensing
Jiayi Ma
Jiayi Ma
Wuhan University
Computer VisionImage FusionImage Matching