GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

📅 2024-06-03
🏛️ International Conference on Machine Learning
📈 Citations: 19
Influential: 4
📄 PDF
🤖 AI Summary
This work addresses core challenges in street-view image geolocation—scarce and low-quality training data for Large Vision-Language Models (LVLMs), weak visual cues, and absence of reasoning annotations—by introducing a novel LVLM paradigm infused with human reasoning knowledge. Methodologically: (1) we propose the first CLIP-based localizability quantification network to automatically curate high-quality street-view datasets; (2) we incorporate multi-step human reasoning trajectories from real-world geolocation games as external knowledge; and (3) we design a two-stage fine-tuning strategy—first enhancing spatial reasoning, then refining coordinate-level localization. On country- and city-level geolocation benchmarks, our method outperforms existing LVLM baselines by 25% and 38%, respectively, surpassing StreetCLIP while requiring significantly lower training cost. The code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract
This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.
Problem

Research questions and friction points this paper is trying to address.

Addressing geo-localization challenges using vision-language models
Overcoming data scarcity and poor image quality in street views
Enhancing reasoning capabilities with human inference knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

CLIP-based network filters locatable street-view images
Integrates human inference knowledge from geo-localization games
Fine-tunes LVLM through reasoning and location-tuning stages
🔎 Similar Papers
No similar papers found.
L
Ling Li
The Hong Kong University of Science and Technology (Guangzhou)
Yu Ye
Yu Ye
Tongji University
urban morphologyurban designurban sciencearchitecture
B
Bingchuan Jiang
Strategic Support Force Information Engineering University
W
Wei Zeng
The Hong Kong University of Science and Technology