Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision foundation models suffer from a “comprehension–matching misalignment” in image feature matching: their single-image embeddings are inconsistent with the representations required for cross-image matching, and they lack fine-grained cross-image alignment mechanisms. To address this, we propose IMD—a framework built upon a pre-trained generative diffusion model. IMD introduces a cross-image interactive prompting module to enable bidirectional instance-level feature fusion and employs contrastive learning to optimize match-oriented representations. We are the first to systematically identify and characterize this misalignment phenomenon and introduce IMIM, the first benchmark dedicated to multi-instance matching. Experiments demonstrate that IMD achieves state-of-the-art performance on mainstream benchmarks and yields a 12% relative improvement on IMIM, significantly mitigating the misalignment—particularly excelling in complex, multi-instance scenarios.

Technology Category

Application Category

📝 Abstract
Leveraging the vision foundation models has emerged as a mainstream paradigm that improves the performance of image feature matching. However, previous works have ignored the misalignment when introducing the foundation models into feature matching. The misalignment arises from the discrepancy between the foundation models focusing on single-image understanding and the cross-image understanding requirement of feature matching. Specifically, 1) the embeddings derived from commonly used foundation models exhibit discrepancies with the optimal embeddings required for feature matching; 2) lacking an effective mechanism to leverage the single-image understanding ability into cross-image understanding. A significant consequence of the misalignment is they struggle when addressing multi-instance feature matching problems. To address this, we introduce a simple but effective framework, called IMD (Image feature Matching with a pre-trained Diffusion model) with two parts: 1) Unlike the dominant solutions employing contrastive-learning based foundation models that emphasize global semantics, we integrate the generative-based diffusion models to effectively capture instance-level details. 2) We leverage the prompt mechanism in generative model as a natural tunnel, propose a novel cross-image interaction prompting module to facilitate bidirectional information interaction between image pairs. To more accurately measure the misalignment, we propose a new benchmark called IMIM, which focuses on multi-instance scenarios. Our proposed IMD establishes a new state-of-the-art in commonly evaluated benchmarks, and the superior improvement 12% in IMIM indicates our method efficiently mitigates the misalignment.
Problem

Research questions and friction points this paper is trying to address.

Misalignment between vision foundation models and feature matching requirements
Lack of cross-image understanding in single-image foundation models
Difficulty in addressing multi-instance feature matching problems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates generative-based diffusion models for details
Uses cross-image interaction prompting module
Proposes IMIM benchmark for misalignment measurement
🔎 Similar Papers
No similar papers found.
Y
Yuhan Liu
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Jingwen Fu
Jingwen Fu
Xi'an Jiaotong University
Computer Visionmachine learning
Y
Yang Wu
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
K
Kangyi Wu
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
P
Pengna Li
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
J
Jiayi Wu
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Sanping Zhou
Sanping Zhou
Xi'an Jiaotong University
Computer VisionMachine Learning
Jingmin Xin
Jingmin Xin
Xi'an Jiaotong University
Statistical and Array Sensor ArrayPattern Recognition