🤖 AI Summary
Existing vision-language models for image geolocation are often constrained by fixed reasoning depths or low-quality retrieval corpora, leading to hallucinations and insufficient localization accuracy. To address these limitations, this work proposes Geo-ADAPT, an adaptive reasoning framework that introduces a novel localizability scoring mechanism to dynamically adjust reasoning depth. The authors construct Geo-ADAPT-51K, a hierarchical reasoning dataset, and integrate retrieval-augmented generation with group relative policy optimization (GRPO) guided by tailored rewards. This approach achieves state-of-the-art performance across multiple geolocation benchmarks, significantly reducing hallucination rates while simultaneously improving both localization accuracy and reasoning efficiency.
📝 Abstract
The emergence of Vision-Language Models (VLMs) has introduced new paradigms for global image geo-localization through retrieval-augmented generation (RAG) and reasoning-driven inference. However, RAG methods are constrained by retrieval database quality, while reasoning-driven approaches fail to internalize image locatability, relying on inefficient, fixed-depth reasoning paths that increase hallucinations and degrade accuracy. To overcome these limitations, we introduce an Optimized Locatability Score that quantifies an image's suitability for deep reasoning in geo-localization. Using this metric, we curate Geo-ADAPT-51K, a locatability-stratified reasoning dataset enriched with augmented reasoning trajectories for complex visual scenes. Building on this foundation, we propose a two-stage Group Relative Policy Optimization (GRPO) curriculum with customized reward functions that regulate adaptive reasoning depth, visual grounding, and hierarchical geographical accuracy. Our framework, Geo-ADAPT, learns an adaptive reasoning policy, achieves state-of-the-art performance across multiple geo-localization benchmarks, and substantially reduces hallucinations by reasoning both adaptively and efficiently.