🤖 AI Summary
Cross-view localization remains challenging due to the difficulty of establishing fine-grained pixel-level correspondences between ground-level and aerial imagery, limiting both localization accuracy and interpretability. To address this, we propose an end-to-end cross-view fine-grained matching framework: (i) a surface-aware bird’s-eye-view (BEV) projection model that encodes geometric priors; (ii) a SimRefiner module that refines similarity matrices via iterative affinity propagation; and (iii) a local-global residual correction mechanism for precise correspondence refinement. We further introduce CVFM—the first benchmark with pixel-level ground-truth annotations—enabling RANSAC-free direct matching. Our method significantly improves robustness and accuracy under extreme viewpoint disparities, achieving state-of-the-art performance across multiple benchmarks. This work establishes a new paradigm for cross-view localization that is both highly accurate and inherently interpretable.
📝 Abstract
Cross-view localization aims to estimate the 3 degrees of freedom pose of a ground-view image by registering it to aerial or satellite imagery. It is essential in GNSS-denied environments such as urban canyons and disaster zones. Existing methods either regress poses directly or align features in a shared bird's-eye view (BEV) space, both built upon accurate spatial correspondences between perspectives. However, these methods fail to establish strict cross-view correspondences, yielding only coarse or geometrically inconsistent matches. Consequently, fine-grained image matching between ground and aerial views remains an unsolved problem, which in turn constrains the interpretability of localization results. In this paper, we revisit cross-view localization from the perspective of cross-view image matching and propose a novel framework that improves both matching and localization. Specifically, we introduce a Surface Model to model visible regions for accurate BEV projection, and a SimRefiner module to refine the similarity matrix through local-global residual correction, eliminating the reliance on post-processing like RANSAC. To further support research in this area, we introduce CVFM, the first benchmark with 32,509 cross-view image pairs annotated with pixel-level correspondences. Extensive experiments demonstrate that our approach substantially improves both localization accuracy and image matching quality, setting new baselines under extreme viewpoint disparity.