GeoLocSFT: Efficient Visual Geolocation via Supervised Fine-Tuning of Multimodal Foundation Models

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual geolocalization aims to accurately infer the geographic coordinates of a query image captured anywhere on Earth. However, it faces persistent challenges in precision and generalization due to the planet-scale search space and high visual ambiguity across diverse locations. This paper proposes a lightweight, efficient method based on supervised fine-tuning (SFT) of the Gemma-3 multimodal foundation model, achieving state-of-the-art performance on global localization using only 2,700 high-quality image–GPS pairs. Key contributions include: (i) the first empirical demonstration that small-scale, high-fidelity data coupled with lightweight SFT surpasses existing approaches reliant on large-scale databases or complex multi-stage pipelines; (ii) the construction and open-sourcing of MR40k—a novel benchmark targeting sparse-population regions; and (iii) integration of geo-aware sampling, multi-candidate inference, and aggregation strategies. Our method significantly outperforms baselines on Im2GPS-3k, YFCC-4k, and MR40k, with ablation studies confirming the SFT design as the primary source of improvement.

Technology Category

Application Category

📝 Abstract
Accurately determining the geographic location where a single image was taken, visual geolocation, remains a formidable challenge due to the planet's vastness and the deceptive similarity among distant locations. We introduce GeoLocSFT, a framework that demonstrates how targeted supervised fine-tuning (SFT) of a large multimodal foundation model (Gemma 3) using a small, high-quality dataset can yield highly competitive geolocation performance. GeoLocSFT is trained with only 2700 carefully selected image-GPS pairs from our geographically diverse MR600k dataset. Despite this limited data, our SFT-centric approach substantially improves over baseline models and achieves robust results on standard benchmarks such as Im2GPS-3k and YFCC-4k, as well as on our newly proposed and challenging MR40k benchmark, aimed specifically at sparsely populated regions. Further, we explore multi-candidate inference and aggregation strategies but find that the core gains are already realized at the SFT stage. Our findings highlight the power of high-quality supervision and efficient SFT for planet-scale image geolocation, especially when compared to prior methods that require massive databases or complex pipelines. To foster further research, we publicly release the MR40k benchmark dataset.
Problem

Research questions and friction points this paper is trying to address.

Accurately geolocating single images despite vast planet scale
Improving geolocation via supervised fine-tuning with minimal data
Enhancing performance in sparse regions with new benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised fine-tuning of multimodal foundation models
Small high-quality dataset for efficient training
Multi-candidate inference and aggregation strategies
🔎 Similar Papers
No similar papers found.