🤖 AI Summary
This work addresses fine-grained cross-view localization between ground-level and aerial imagery, aiming to estimate the full 6-degree-of-freedom (6DoF) 3D pose of a ground image within a georeferenced aerial orthomosaic. To this end, we propose a height-aware bird’s-eye view (BEV) representation generation mechanism: 3D point cloud projection is combined with height-dimension feature selection to construct an interpretable, geometry-grounded BEV space. We further design a weakly supervised, pose-guided point correspondence learning paradigm, integrating sparse point-pair sampling with Procrustes analysis to enforce geometrically and semantically consistent matching. Our core innovations include the first height-aware feature pooling strategy for BEV generation and a novel weakly supervised contrastive learning framework. Evaluated on the cross-regional VIGOR benchmark, our method reduces average pose estimation error by 28% over prior art, significantly improving both cross-view semantic alignment accuracy and generalization across diverse geographic regions.
📝 Abstract
We propose a novel fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image in an aerial image of the surroundings by matching fine-grained features between the two images. The pose is estimated by aligning a point plane generated from the ground image with a point plane sampled from the aerial image. To generate the ground points, we first map ground image features to a 3D point cloud. Our method then learns to select features along the height dimension to pool the 3D points to a Bird's-Eye-View (BEV) plane. This selection enables us to trace which feature in the ground image contributes to the BEV representation. Next, we sample a set of sparse matches from computed point correspondences between the two point planes and compute their relative pose using Procrustes alignment. Compared to the previous state-of-the-art, our method reduces the mean localization error by 28% on the VIGOR cross-area test set. Qualitative results show that our method learns semantically consistent matches across ground and aerial views through weakly supervised learning from the camera pose.