🤖 AI Summary
This work addresses the challenges of global-scale visual geolocalization—namely visual ambiguity and the inherent hierarchical structure of geographic regions—by proposing a geocentric hierarchical approach. It is the first to embed geographic entities such as countries, regions, subregions, and cities into hyperbolic space and introduces Geo-Weighted Hyperbolic contrastive learning that integrates the haversine distance to directly align images with geographic entities. The method achieves significant improvements in both accuracy and efficiency, establishing a new state of the art on the OSV5M benchmark: it reduces the mean geodesic error by 19.5%, improves subregion accuracy by 43%, and replaces over five million image embeddings with only 240,000 entity embeddings, offering high precision, strong interpretability, and substantially lower storage overhead.
📝 Abstract
Visual geolocalization, the task of predicting where an image was taken, remains challenging due to global scale, visual ambiguity, and the inherently hierarchical structure of geography. Existing paradigms rely on either large-scale retrieval, which requires storing a large number of image embeddings, grid-based classifiers that ignore geographic continuity, or generative models that diffuse over space but struggle with fine detail. We introduce an entity-centric formulation of geolocation that replaces image-to-image retrieval with a compact hierarchy of geographic entities embedded in Hyperbolic space. Images are aligned directly to country, region, subregion, and city entities through Geo-Weighted Hyperbolic contrastive learning by directly incorporating haversine distance into the contrastive objective. This hierarchical design enables interpretable predictions and efficient inference with 240k entity embeddings instead of over 5 million image embeddings on the OSV5M benchmark, on which our method establishes a new state-of-the-art performance. Compared to the current methods in the literature, it reduces mean geodesic error by 19.5\%, while improving the fine-grained subregion accuracy by 43%. These results demonstrate that geometry-aware hierarchical embeddings provide a scalable and conceptually new alternative for global image geolocation.