GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

📅 2024-06-03

🏛️ International Conference on Machine Learning

📈 Citations: 19

✨ Influential: 4

career value

187K/year

🤖 AI Summary

This work addresses core challenges in street-view image geolocation—scarce and low-quality training data for Large Vision-Language Models (LVLMs), weak visual cues, and absence of reasoning annotations—by introducing a novel LVLM paradigm infused with human reasoning knowledge. Methodologically: (1) we propose the first CLIP-based localizability quantification network to automatically curate high-quality street-view datasets; (2) we incorporate multi-step human reasoning trajectories from real-world geolocation games as external knowledge; and (3) we design a two-stage fine-tuning strategy—first enhancing spatial reasoning, then refining coordinate-level localization. On country- and city-level geolocation benchmarks, our method outperforms existing LVLM baselines by 25% and 38%, respectively, surpassing StreetCLIP while requiring significantly lower training cost. The code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.

Problem

Research questions and friction points this paper is trying to address.

Addressing geo-localization challenges using vision-language models

Overcoming data scarcity and poor image quality in street views

Enhancing reasoning capabilities with human inference knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

CLIP-based network filters locatable street-view images

Integrates human inference knowledge from geo-localization games

Fine-tunes LVLM through reasoning and location-tuning stages

🔎 Similar Papers

No similar papers found.