🤖 AI Summary
Cross-view geolocalization suffers from semantic degradation caused by extreme viewpoint discrepancies, limiting the performance of conventional direct feature-matching approaches. To address this, we propose leveraging multi-scale UAV-captured 3D scenes as an intermediate semantic bridge between street-level and satellite imagery. Our method employs self-supervised and cross-view contrastive learning to enhance feature alignment; integrates a retrieval-augmented module to improve street-view quality; introduces a patch-aware feature aggregation mechanism to strengthen local consistency; and incorporates multi-scale UAV-derived 3D geometric priors to enable robust cross-modal matching. Evaluated on the University-1652 benchmark, our approach achieves a Recall@1 of 25.75%, demonstrating significant improvements in generalization and robustness under severe viewpoint variations. This work establishes a novel, interpretable, and geometry-aware paradigm for cross-view geolocalization.
📝 Abstract
Cross-view geo-localization aims at establishing location correspondences between different viewpoints. Existing approaches typically learn cross-view correlations through direct feature similarity matching, often overlooking semantic degradation caused by extreme viewpoint disparities. To address this unique problem, we focus on robust feature retrieval under viewpoint variation and propose the novel SkyLink method. We firstly utilize the Google Retrieval Enhancement Module to perform data enhancement on street images, which mitigates the occlusion of the key target due to restricted street viewpoints. The Patch-Aware Feature Aggregation module is further adopted to emphasize multiple local feature aggregations to ensure the consistent feature extraction across viewpoints. Meanwhile, we integrate the 3D scene information constructed from multi-scale UAV images as a bridge between street and satellite viewpoints, and perform feature alignment through self-supervised and cross-view contrastive learning. Experimental results demonstrate robustness and generalization across diverse urban scenarios, which achieve 25.75$%$ Recall@1 accuracy on University-1652 in the UAVM2025 Challenge. Code will be released at https://github.com/HRT00/CVGL-3D.