Revisiting Cross-View Localization from Image Matching

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Cross-view localization remains challenging due to the difficulty of establishing fine-grained pixel-level correspondences between ground-level and aerial imagery, limiting both localization accuracy and interpretability. To address this, we propose an end-to-end cross-view fine-grained matching framework: (i) a surface-aware bird’s-eye-view (BEV) projection model that encodes geometric priors; (ii) a SimRefiner module that refines similarity matrices via iterative affinity propagation; and (iii) a local-global residual correction mechanism for precise correspondence refinement. We further introduce CVFM—the first benchmark with pixel-level ground-truth annotations—enabling RANSAC-free direct matching. Our method significantly improves robustness and accuracy under extreme viewpoint disparities, achieving state-of-the-art performance across multiple benchmarks. This work establishes a new paradigm for cross-view localization that is both highly accurate and inherently interpretable.

Technology Category

Application Category

📝 Abstract

Cross-view localization aims to estimate the 3 degrees of freedom pose of a ground-view image by registering it to aerial or satellite imagery. It is essential in GNSS-denied environments such as urban canyons and disaster zones. Existing methods either regress poses directly or align features in a shared bird's-eye view (BEV) space, both built upon accurate spatial correspondences between perspectives. However, these methods fail to establish strict cross-view correspondences, yielding only coarse or geometrically inconsistent matches. Consequently, fine-grained image matching between ground and aerial views remains an unsolved problem, which in turn constrains the interpretability of localization results. In this paper, we revisit cross-view localization from the perspective of cross-view image matching and propose a novel framework that improves both matching and localization. Specifically, we introduce a Surface Model to model visible regions for accurate BEV projection, and a SimRefiner module to refine the similarity matrix through local-global residual correction, eliminating the reliance on post-processing like RANSAC. To further support research in this area, we introduce CVFM, the first benchmark with 32,509 cross-view image pairs annotated with pixel-level correspondences. Extensive experiments demonstrate that our approach substantially improves both localization accuracy and image matching quality, setting new baselines under extreme viewpoint disparity.

Problem

Research questions and friction points this paper is trying to address.

Estimating ground-view image pose via aerial imagery in GNSS-denied environments

Establishing accurate cross-view correspondences for fine-grained image matching

Improving localization accuracy and matching quality under extreme viewpoint disparity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Surface Model for accurate BEV projection

SimRefiner refines similarity matrix efficiently

CVFM benchmark with pixel-level annotations

🔎 Similar Papers

No similar papers found.