GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the limited reasoning capability of existing vision agents on complex geolocation tasks. To overcome deficiencies in current benchmarks—particularly regarding high-resolution imagery, multi-view inputs, and deep logical reasoning—we introduce GeoBench, a novel, challenging geolocation benchmark. Methodologically, we propose an end-to-end agent-based reasoning framework that tightly integrates adaptive image scaling with web retrieval within the inference pipeline; further, we design a hierarchical reward reinforcement learning mechanism and adopt a multimodal large language model architecture trained via supervised fine-tuning and RLHF. On GeoBench, our approach substantially outperforms leading open-source vision agents and achieves performance comparable to Gemini-2.5-Flash and GPT-5 across most metrics, demonstrating both effectiveness and state-of-the-art capability in real-world geolocation tasks.

Technology Category

Application Category

📝 Abstract

Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.

Problem

Research questions and friction points this paper is trying to address.

Developing agentic models for geolocalization using visual reasoning and web search

Creating a benchmark with global photos and satellite imagery for evaluation

Integrating tool invocation within reasoning loops to enhance geolocalization accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates image zoom and web search tools

Uses supervised fine-tuning and reinforcement learning

Leverages hierarchical reward for geographical information

🔎 Similar Papers

GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model