Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation of Vision-Language Models (VLMs) on open-world image geolocation. We introduce EarthWhere, the first VLM-specific, multi-scale geolocation benchmark covering both country- and street-level real-world scenes. To rigorously assess reasoning capabilities, we propose a fine-grained reasoning-chain evaluation framework featuring a novel Shapley-weighted thought scoring mechanism grounded in critical visual cues—enabling quantitative analysis of regional bias and reasoning efficacy. Evaluation integrates visual recognition, multi-step reasoning, and optional web search, with dual metrics: coordinate accuracy (Acc@k) and hierarchical text-path scoring. Experiments across 13 state-of-the-art VLMs reveal Gemini-2.5-Pro achieves the highest average accuracy (56.32%), while GLM-4.5V attains 34.71%. Notably, when visual cues are insufficient, retrieval augmentation and complex reasoning do not consistently improve performance—highlighting fundamental limitations in current VLM geolocation capabilities.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions, a task that is challenging and of demand in real life, has not been comprehensively evaluated. We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use. EarthWhere comprises 810 globally distributed images across two complementary geolocation scales: WhereCountry (i.e., 500 multiple-choice question-answering, with country-level answer and panoramas) and WhereStreet (i.e., 310 fine-grained street-level identification tasks requiring multi-step reasoning with optional web search). For evaluation, we adopt the final-prediction metrics: location accuracies within k km (Acc@k) for coordinates and hierarchical path scores for textual localization. Beyond this, we propose to explicitly score intermediate reasoning chains using human-verified key visual clues and a Shapley-reweighted thinking score that attributes credit to each clue's marginal contribution. We benchmark 13 state-of-the-art VLMs with web searching tools on our EarthWhere and report different types of final answer accuracies as well as the calibrated model thinking scores. Overall, Gemini-2.5-Pro achieves the best average accuracy at 56.32%, while the strongest open-weight model, GLM-4.5V, reaches 34.71%. We reveal that web search and reasoning do not guarantee improved performance when visual clues are limited, and models exhibit regional biases, achieving up to 42.7% higher scores in certain areas than others. These findings highlight not only the promise but also the persistent challenges of models to mitigate bias and achieve robust, fine-grained localization. We open-source our benchmark at https://github.com/UCSC-VLAA/EarthWhere.
Problem

Research questions and friction points this paper is trying to address.

Evaluating vision-language models' geolocation skills across global and street scales
Assessing model reasoning using visual clues and Shapley-reweighted thinking scores
Revealing performance gaps and regional biases in fine-grained image localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

EarthWhere benchmark evaluates VLM geolocation skills
Uses human-verified visual clues and Shapley-reweighted thinking scores
Tests models across country-level and street-level localization tasks
🔎 Similar Papers
No similar papers found.