๐ค AI Summary
Existing feature matching methods rely on scarce multi-view imagery and suffer from limited 3D correspondence modeling capability when using single-view 2D encoders, resulting in poor cross-domain generalization. To address this, we propose Lift to Match (L2M), a two-stage framework that learns 3D-aware feature matching from single-view images without multi-view supervisionโthe first of its kind. L2M lifts 2D features into 3D space and models the 3D feature field via differentiable Gaussian representations, enabling self-supervised matching learning through novel-view rendering. Trained solely on large-scale single-view image collections, L2M achieves state-of-the-art performance across multiple zero-shot matching benchmarks. It significantly improves robustness in complex scenes and enhances cross-domain generalization, demonstrating strong scalability and practical applicability in real-world vision tasks.
๐ Abstract
Feature matching plays a fundamental role in many computer vision tasks, yet existing methods heavily rely on scarce and clean multi-view image collections, which constrains their generalization to diverse and challenging scenarios. Moreover, conventional feature encoders are typically trained on single-view 2D images, limiting their capacity to capture 3D-aware correspondences. In this paper, we propose a novel two-stage framework that lifts 2D images to 3D space, named as extbf{Lift to Match (L2M)}, taking full advantage of large-scale and diverse single-view images. To be specific, in the first stage, we learn a 3D-aware feature encoder using a combination of multi-view image synthesis and 3D feature Gaussian representation, which injects 3D geometry knowledge into the encoder. In the second stage, a novel-view rendering strategy, combined with large-scale synthetic data generation from single-view images, is employed to learn a feature decoder for robust feature matching, thus achieving generalization across diverse domains. Extensive experiments demonstrate that our method achieves superior generalization across zero-shot evaluation benchmarks, highlighting the effectiveness of the proposed framework for robust feature matching.