🤖 AI Summary
This work addresses core challenges in street-view image geolocation—scarce and low-quality training data for Large Vision-Language Models (LVLMs), weak visual cues, and absence of reasoning annotations—by introducing a novel LVLM paradigm infused with human reasoning knowledge. Methodologically: (1) we propose the first CLIP-based localizability quantification network to automatically curate high-quality street-view datasets; (2) we incorporate multi-step human reasoning trajectories from real-world geolocation games as external knowledge; and (3) we design a two-stage fine-tuning strategy—first enhancing spatial reasoning, then refining coordinate-level localization. On country- and city-level geolocation benchmarks, our method outperforms existing LVLM baselines by 25% and 38%, respectively, surpassing StreetCLIP while requiring significantly lower training cost. The code and dataset are publicly released.
📝 Abstract
This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.