🤖 AI Summary
This work addresses the limitation of existing image geolocation methods in providing fine-grained supervision over the evidence search and verification process, which hinders their ability to emulate the multi-step reasoning of human experts. To overcome this, the authors propose a reinforcement learning–based multi-agent reasoning framework that jointly optimizes three key decisions—region selection, query generation, and evidence discrimination—to form a closed-loop inference pipeline. The approach introduces a novel process-level reward mechanism tailored to these stages and leverages annotated tool-anchored trajectory data to enable dense supervision over noisy retrieval results. By integrating offline retrieval caching, visual localization rewards, and query utility evaluation modules, the system supports end-to-end training on a 4B-parameter language model. It significantly outperforms strong retrieval-augmented baselines on the Im2GPS3k and YFCC4k benchmarks, achieving performance comparable to substantially larger models.
📝 Abstract
Image geo-localization aims to determine where a photograph was taken, a task that often requires more than recognizing visible landmarks. Human experts typically solve it through an iterative workflow: they inspect informative regions, form location hypotheses, seek external evidence, and revise their judgments as new clues appear. Existing methods only partially capture this process: direct prediction methods bypass evidence acquisition altogether, while retrieval-augmented methods introduce external evidence but usually provide limited supervision on the intermediate decisions of where to search, how to query, and how to filter noisy results. We present REVERSE, a framework that reinforces the interplay between evidence search and verification to enable multi-turn agentic reasoning. REVERSE teaches three intermediate decisions: where to look, what to query, and what evidence to trust. To support this, we construct tool-grounded trajectories with annotated region selections, search observations, and geo-informative evidence labels, and introduce process rewards for visual grounding, query utility, and evidence discrimination. An offline search cache makes retrieval observations stable and reusable during reinforcement learning, enabling dense supervision over noisy search results. With a 4B model, REVERSE outperforms strong retrieval-augmented baselines and rivals substantially larger models on Im2GPS3k and YFCC4k. Code is available at https://github.com/yonglleee/REVERSE.