🤖 AI Summary
Global image geolocation suffers from limited accuracy under visual ambiguity or absence of distinctive landmarks, primarily because existing contrastive learning methods neglect spatial autocorrelation—making it difficult to distinguish *false negatives* (visually similar yet geographically proximate samples incorrectly labeled as negative) from *hard negatives* (visually similar but geographically distant). This work introduces the semivariogram—a fundamental tool in geostatistics—into contrastive learning for the first time, explicitly modeling the spatial dependence between geographic distance and feature similarity to guide hard negative mining and rectify false negative labels. Our approach integrates the GeoCLIP architecture with semivariogram-based spatial prior fitting. Evaluated on the OSV5M benchmark, it achieves significant improvements in geolocation accuracy, particularly for fine-grained localization (<1 km). Results demonstrate that incorporating explicit spatial priors substantially enhances vision–geography alignment.
📝 Abstract
Accurate and robust image-based geo-localization at a global scale is challenging due to diverse environments, visually ambiguous scenes, and the lack of distinctive landmarks in many regions. While contrastive learning methods show promising performance by aligning features between street-view images and corresponding locations, they neglect the underlying spatial dependency in the geographic space. As a result, they fail to address the issue of false negatives -- image pairs that are both visually and geographically similar but labeled as negatives, and struggle to effectively distinguish hard negatives, which are visually similar but geographically distant. To address this issue, we propose a novel spatially regularized contrastive learning strategy that integrates a semivariogram, which is a geostatistical tool for modeling how spatial correlation changes with distance. We fit the semivariogram by relating the distance of images in feature space to their geographical distance, capturing the expected visual content in a spatial correlation. With the fitted semivariogram, we define the expected visual dissimilarity at a given spatial distance as reference to identify hard negatives and false negatives. We integrate this strategy into GeoCLIP and evaluate it on the OSV5M dataset, demonstrating that explicitly modeling spatial priors improves image-based geo-localization performance, particularly at finer granularity.