🤖 AI Summary
This work addresses the longstanding trade-off between efficiency and accuracy in local feature matching. We propose an efficient Transformer architecture that introduces a novel multi-homography hypothesis modeling mechanism to explicitly capture continuous correspondence relationships between images, coupled with a lightweight unidirectional cross-attention module that significantly reduces computational overhead. The resulting framework enables end-to-end learnable dense matching, achieving both high accuracy and substantially accelerated inference. On YFCC100M, our method matches LoFTR’s matching accuracy while running four times faster; robustness and generalization are further validated on MegaDepth, ScanNet, and HPatches. Our core contribution lies in the synergistic design of multi-homography modeling and unidirectional cross-attention, establishing a new paradigm for efficient, high-fidelity local feature matching.
📝 Abstract
We tackle the efficiency problem of learning local feature matching. Recent advancements have given rise to purely CNN-based and transformer-based approaches, each augmented with deep learning techniques. While CNN-based methods often excel in matching speed, transformer-based methods tend to provide more accurate matches. We propose an efficient transformer-based network architecture for local feature matching. This technique is built on constructing multiple homography hypotheses to approximate the continuous correspondence in the real world and uni-directional cross-attention to accelerate the refinement. On the YFCC100M dataset, our matching accuracy is competitive with LoFTR, a state-of-the-art transformer-based architecture, while the inference speed is boosted to 4 times, even outperforming the CNN-based methods. Comprehensive evaluations on other open datasets such as Megadepth, ScanNet, and HPatches demonstrate our method's efficacy, highlighting its potential to significantly enhance a wide array of downstream applications.