🤖 AI Summary
This work addresses the limited generalization of cross-view geolocalization in remote sensing, which stems from large viewpoint discrepancies and dataset bias. The paper proposes the first zero-shot framework that achieves high-performance matching without any training. By leveraging features from vision foundation models, the method hierarchically extracts discriminative visual cues and progressively aligns the statistical manifolds between drone and satellite images through generalized mean pooling, scale-weighted RMAC, domain-level PCA, and orthogonal Procrustes analysis. Evaluated on standard benchmarks, the approach significantly outperforms existing methods, surpassing supervised counterparts by over 20% in Recall@1 on the LO-UCV dataset. This study establishes, for the first time, the feasibility and superiority of zero-shot cross-view geolocalization.
📝 Abstract
Cross-View Geo-Localization (CVGL) in remote sensing aims to locate a drone-view query by matching it to geo-tagged satellite images. Although supervised methods have achieved strong results on closeset benchmarks, they often fail to generalize to unconstrained, real-world scenarios due to severe viewpoint differences and dataset bias. To overcome these limitations, we present VFM-Loc, a training-free framework for zero-shot CVGL that leverages the generalizable visual representations from vision foundational models (VFMs). VFM-Loc identifies and matches discriminative visual clues across different viewpoints through a progressive alignment strategy. First, we design a hierarchical clue extraction mechanism using Generalized Mean pooling and Scale-Weighted RMAC to preserve distinctive visual clues across scales while maintaining hierarchical confidence. Second, we introduce a statistical manifold alignment pipeline based on domain-wise PCA and Orthogonal Procrustes analysis, linearly aligning heterogeneous feature distributions in a shared metric space. Experiments demonstrate that VFM-Loc exhibits strong zero-shot accuracy on standard benchmarks and surpasses supervised methods by over 20% in Recall@1 on the challenging LO-UCV dataset with large oblique angles. This work highlights that principled alignment of pre-trained features can effectively bridge the cross-view gap, establishing a robust and training-free paradigm for real-world CVGL. The relevant code is made available at: https://github.com/DingLei14/VFM-Loc.