🤖 AI Summary
Global visual geolocalization aims to infer the real-world geographic coordinates of an image solely from its visual content, with the core challenge being effective alignment between visual and geographic representations. To address this, we propose Hierarchical Geographic Embedding (HGE), the first approach to model geographic priors as a multi-granularity hierarchical structure. We further design a semantic segmentation-guided feature fusion module that jointly encodes appearance features and scene-level semantics to enable fine-grained visual–geographic alignment. Evaluated on five standard benchmarks across 25 metrics, our method outperforms existing state-of-the-art methods and mainstream large vision-language models on 22 metrics. It significantly improves robustness and accuracy in cross-region and multi-scale localization, demonstrating superior generalization under varying geographic and visual conditions.
📝 Abstract
Worldwide visual geo-localization seeks to determine the geographic location of an image anywhere on Earth using only its visual content. Learned representations of geography for visual geo-localization remain an active research topic despite much progress. We formulate geo-localization as aligning the visual representation of the query image with a learned geographic representation. Our novel geographic representation explicitly models the world as a hierarchy of geographic embeddings. Additionally, we introduce an approach to efficiently fuse the appearance features of the query image with its semantic segmentation map, forming a robust visual representation. Our main experiments demonstrate improved all-time bests in 22 out of 25 metrics measured across five benchmark datasets compared to prior state-of-the-art (SOTA) methods and recent Large Vision-Language Models (LVLMs). Additional ablation studies support the claim that these gains are primarily driven by the combination of geographic and visual representations.