🤖 AI Summary
This work addresses the limited generalization of supervised image matching models to unseen domains by proposing a zero-shot image matching approach that leverages generic visual features extracted from pretrained DINO without any fine-tuning. The method employs a many-to-many matching paradigm coupled with a Harmonic Consensus Maximization (HCM) mechanism, which interprets existing robust estimation techniques as zeroth-order approximations from a likelihood-based perspective, thereby enabling more efficient and fine-grained correspondence estimation. Remarkably, the proposed framework achieves performance on out-of-distribution data comparable to specialized supervised models and significantly outperforms existing zero-shot methods in camera pose estimation tasks.
📝 Abstract
Motivated by the limited generalization of supervised image matching models to unseen image domains, we explore the zero-shot deployment of DINO features for this task. The generalist visual representation extracted from DINO has inherent ambiguity when used to match feature points among semantically similar instances, prompting us to adopt a many-to-many (m-to-m) matching paradigm. However, the existing robust mechanism under m-to-m data association is computationally heavy, which requires finding a maximum-cardinality matching in the inlier association graph for each parameter evaluation. To address this inefficiency, we introduce a novel likelihood perspective, which interprets the existing method as a zeroth-order approximation of otherwise intractable likelihood calculation,and inspires us to propose a faster and finer-grained robust mechanism, termed as Harmonic Consensus Maximization (HCM). Take camera pose estimation as an exemplifying downstream task, we demonstrate that general-purpose visual features, used out of the box without any adaptation, can compete with specialized matching models on out-of-distribution datasets when mated with m-to-m association and the HCM mechanism.