🤖 AI Summary
Visual geolocalization aims to accurately infer the geographic coordinates of a query image captured anywhere on Earth. However, it faces persistent challenges in precision and generalization due to the planet-scale search space and high visual ambiguity across diverse locations. This paper proposes a lightweight, efficient method based on supervised fine-tuning (SFT) of the Gemma-3 multimodal foundation model, achieving state-of-the-art performance on global localization using only 2,700 high-quality image–GPS pairs. Key contributions include: (i) the first empirical demonstration that small-scale, high-fidelity data coupled with lightweight SFT surpasses existing approaches reliant on large-scale databases or complex multi-stage pipelines; (ii) the construction and open-sourcing of MR40k—a novel benchmark targeting sparse-population regions; and (iii) integration of geo-aware sampling, multi-candidate inference, and aggregation strategies. Our method significantly outperforms baselines on Im2GPS-3k, YFCC-4k, and MR40k, with ablation studies confirming the SFT design as the primary source of improvement.
📝 Abstract
Accurately determining the geographic location where a single image was taken, visual geolocation, remains a formidable challenge due to the planet's vastness and the deceptive similarity among distant locations. We introduce GeoLocSFT, a framework that demonstrates how targeted supervised fine-tuning (SFT) of a large multimodal foundation model (Gemma 3) using a small, high-quality dataset can yield highly competitive geolocation performance. GeoLocSFT is trained with only 2700 carefully selected image-GPS pairs from our geographically diverse MR600k dataset. Despite this limited data, our SFT-centric approach substantially improves over baseline models and achieves robust results on standard benchmarks such as Im2GPS-3k and YFCC-4k, as well as on our newly proposed and challenging MR40k benchmark, aimed specifically at sparsely populated regions. Further, we explore multi-candidate inference and aggregation strategies but find that the core gains are already realized at the SFT stage. Our findings highlight the power of high-quality supervision and efficient SFT for planet-scale image geolocation, especially when compared to prior methods that require massive databases or complex pipelines. To foster further research, we publicly release the MR40k benchmark dataset.