FG$^2$: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses fine-grained cross-view localization between ground-level and aerial imagery, aiming to estimate the full 6-degree-of-freedom (6DoF) 3D pose of a ground image within a georeferenced aerial orthomosaic. To this end, we propose a height-aware bird’s-eye view (BEV) representation generation mechanism: 3D point cloud projection is combined with height-dimension feature selection to construct an interpretable, geometry-grounded BEV space. We further design a weakly supervised, pose-guided point correspondence learning paradigm, integrating sparse point-pair sampling with Procrustes analysis to enforce geometrically and semantically consistent matching. Our core innovations include the first height-aware feature pooling strategy for BEV generation and a novel weakly supervised contrastive learning framework. Evaluated on the cross-regional VIGOR benchmark, our method reduces average pose estimation error by 28% over prior art, significantly improving both cross-view semantic alignment accuracy and generalization across diverse geographic regions.

Technology Category

Application Category

📝 Abstract
We propose a novel fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image in an aerial image of the surroundings by matching fine-grained features between the two images. The pose is estimated by aligning a point plane generated from the ground image with a point plane sampled from the aerial image. To generate the ground points, we first map ground image features to a 3D point cloud. Our method then learns to select features along the height dimension to pool the 3D points to a Bird's-Eye-View (BEV) plane. This selection enables us to trace which feature in the ground image contributes to the BEV representation. Next, we sample a set of sparse matches from computed point correspondences between the two point planes and compute their relative pose using Procrustes alignment. Compared to the previous state-of-the-art, our method reduces the mean localization error by 28% on the VIGOR cross-area test set. Qualitative results show that our method learns semantically consistent matches across ground and aerial views through weakly supervised learning from the camera pose.
Problem

Research questions and friction points this paper is trying to address.

Estimates 3DOF pose via fine-grained feature matching
Aligns ground and aerial images using BEV representation
Reduces localization error by 28% via weak supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained feature matching for cross-view localization
3D point cloud to BEV plane feature pooling
Procrustes alignment for sparse match pose estimation
🔎 Similar Papers
No similar papers found.