Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space

๐Ÿ“… 2025-06-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing feature matching methods rely on scarce multi-view imagery and suffer from limited 3D correspondence modeling capability when using single-view 2D encoders, resulting in poor cross-domain generalization. To address this, we propose Lift to Match (L2M), a two-stage framework that learns 3D-aware feature matching from single-view images without multi-view supervisionโ€”the first of its kind. L2M lifts 2D features into 3D space and models the 3D feature field via differentiable Gaussian representations, enabling self-supervised matching learning through novel-view rendering. Trained solely on large-scale single-view image collections, L2M achieves state-of-the-art performance across multiple zero-shot matching benchmarks. It significantly improves robustness in complex scenes and enhances cross-domain generalization, demonstrating strong scalability and practical applicability in real-world vision tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Feature matching plays a fundamental role in many computer vision tasks, yet existing methods heavily rely on scarce and clean multi-view image collections, which constrains their generalization to diverse and challenging scenarios. Moreover, conventional feature encoders are typically trained on single-view 2D images, limiting their capacity to capture 3D-aware correspondences. In this paper, we propose a novel two-stage framework that lifts 2D images to 3D space, named as extbf{Lift to Match (L2M)}, taking full advantage of large-scale and diverse single-view images. To be specific, in the first stage, we learn a 3D-aware feature encoder using a combination of multi-view image synthesis and 3D feature Gaussian representation, which injects 3D geometry knowledge into the encoder. In the second stage, a novel-view rendering strategy, combined with large-scale synthetic data generation from single-view images, is employed to learn a feature decoder for robust feature matching, thus achieving generalization across diverse domains. Extensive experiments demonstrate that our method achieves superior generalization across zero-shot evaluation benchmarks, highlighting the effectiveness of the proposed framework for robust feature matching.
Problem

Research questions and friction points this paper is trying to address.

Overcoming reliance on scarce multi-view images for feature matching
Enhancing 2D feature encoders to capture 3D-aware correspondences
Achieving robust feature matching across diverse domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lifts 2D images to 3D space
Uses multi-view synthesis and 3D features
Novel-view rendering with synthetic data
๐Ÿ”Ž Similar Papers
No similar papers found.