Fine-Grained Cross-View Localization via Local Feature Matching and Monocular Depth Priors

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address registration inaccuracies and information loss in cross-view localization caused by perspective distortion and severe height compression, this paper proposes a fine-grained, no-fine-tuning cross-view localization method. Our approach jointly optimizes local feature matching between ground-level and aerial images and monocular depth priors, directly lifting matched keypoints into bird’s-eye view (BEV) space. We further introduce scale-aware Procrustes alignment to estimate 3-degree-of-freedom camera pose. The method accommodates arbitrary metric or relative depth inputs and models cross-view correspondences via weakly supervised learning. Experiments demonstrate significant improvements in localization accuracy and robustness under challenging conditions—including cross-region generalization and unknown camera orientation—while maintaining high interpretability and practical deployability.

Technology Category

Application Category

📝 Abstract

We propose an accurate and highly interpretable fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image by matching its local features with a reference aerial image. Previous methods typically transform the ground image into a bird's-eye view (BEV) representation and then align it with the aerial image for localization. However, this transformation often leads to information loss due to perspective distortion or compression of height information, thereby degrading alignment quality with the aerial view. In contrast, our method directly establishes correspondences between ground and aerial images and lifts only the matched keypoints to BEV space using monocular depth prior. Notably, modern depth predictors can provide reliable metric depth when the test samples are similar to the training data. When the depth distribution differs, they still produce consistent relative depth, i.e., depth accurate up to an unknown scale. Our method supports both metric and relative depth. It employs a scale-aware Procrustes alignment to estimate the camera pose from the correspondences and optionally recover the scale when using relative depth. Experimental results demonstrate that, with only weak supervision on camera pose, our method learns accurate local feature correspondences and achieves superior localization performance under challenging conditions, such as cross-area generalization and unknown orientation. Moreover, our method is compatible with various relative depth models without requiring per-model finetuning. This flexibility, combined with strong localization performance, makes it well-suited for real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

Estimating ground image pose via local feature matching with aerial imagery

Addressing information loss from perspective distortion in cross-view localization

Supporting both metric and relative depth models for flexible deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct local feature matching between ground and aerial images

Lifting matched keypoints using monocular depth priors

Scale-aware Procrustes alignment for camera pose estimation

🔎 Similar Papers

F3Loc: Fusion and Filtering for Floorplan Localization