🤖 AI Summary
Global image geolocalization faces challenges due to substantial visual regional variation and the difficulty of regressing precise GPS coordinates. Existing two-stage retrieval methods rely on pointwise supervision and simplistic similarity metrics, neglecting spatial structural relationships among candidate locations. This paper proposes a distance-aware ranking framework, the first to formulate geolocalization as a cross-modal ranking task under multi-order distance constraints. We design a multi-order distance loss that jointly optimizes both absolute and relative geographic distance ranking. We introduce GeoRanking—the first multimodal dataset explicitly constructed for geographic ranking—and employ large vision-language models for joint query-candidate encoding, integrating pointwise supervision with structured spatial supervision. Our approach achieves state-of-the-art performance on IM2GPS3K and YFCC4K, significantly outperforming prior methods.
📝 Abstract
Worldwide image geolocalization-the task of predicting GPS coordinates from images taken anywhere on Earth-poses a fundamental challenge due to the vast diversity in visual content across regions. While recent approaches adopt a two-stage pipeline of retrieving candidates and selecting the best match, they typically rely on simplistic similarity heuristics and point-wise supervision, failing to model spatial relationships among candidates. In this paper, we propose GeoRanker, a distance-aware ranking framework that leverages large vision-language models to jointly encode query-candidate interactions and predict geographic proximity. In addition, we introduce a multi-order distance loss that ranks both absolute and relative distances, enabling the model to reason over structured spatial relationships. To support this, we curate GeoRanking, the first dataset explicitly designed for geographic ranking tasks with multimodal candidate information. GeoRanker achieves state-of-the-art results on two well-established benchmarks (IM2GPS3K and YFCC4K), significantly outperforming current best methods.