🤖 AI Summary
This work addresses the susceptibility of existing image geolocation methods to factual hallucination and limited generalization in open-world scenarios. To mitigate these issues, we propose LocationAgent, a novel framework featuring a hierarchical Reasoner-Executor-Recorder (RER) agent architecture that embeds explicit reasoning logic within the model while leveraging external tools to verify geographic evidence, thereby establishing a hypothesis-validation loop. The design incorporates role separation and context compression mechanisms, complemented by multi-source cue exploration tools, which collectively suppress error propagation in multi-step reasoning. We further introduce CCL-Bench, the first Chinese-language geolocation benchmark, on which our method achieves at least a 30% performance gain over current approaches under zero-shot settings, demonstrating its effectiveness and strong generalization capability.
📝 Abstract
Image geolocation aims to infer capture locations based on visual content. Fundamentally, this constitutes a reasoning process composed of \textit{hypothesis-verification cycles}, requiring models to possess both geospatial reasoning capabilities and the ability to verify evidence against geographic facts. Existing methods typically internalize location knowledge and reasoning patterns into static memory via supervised training or trajectory-based reinforcement fine-tuning. Consequently, these methods are prone to factual hallucinations and generalization bottlenecks in open-world settings or scenarios requiring dynamic knowledge. To address these challenges, we propose a Hierarchical Localization Agent, called LocationAgent. Our core philosophy is to retain hierarchical reasoning logic within the model while offloading the verification of geographic evidence to external tools. To implement hierarchical reasoning, we design the RER architecture (Reasoner-Executor-Recorder), which employs role separation and context compression to prevent the drifting problem in multi-step reasoning. For evidence verification, we construct a suite of clue exploration tools that provide diverse evidence to support location reasoning. Furthermore, to address data leakage and the scarcity of Chinese data in existing datasets, we introduce CCL-Bench (China City Location Bench), an image geolocation benchmark encompassing various scene granularities and difficulty levels. Extensive experiments demonstrate that LocationAgent significantly outperforms existing methods by at least 30\% in zero-shot settings.