🤖 AI Summary
Multimodal large language models (MLLMs) exhibit significant limitations in stepwise geographic reasoning. Method: We introduce GeoCoT, the first multimodal chain-of-thought (CoT) benchmark specifically designed for geographic reasoning. GeoCoT comprises 1.46 million street-view images and 21-step fine-grained CoT question-answer sequences—yielding over 30 million Q&A pairs—covering four reasoning dimensions: visual perception, spatial reasoning, cultural understanding, and precise geolocation. We propose a novel hierarchical CoT evaluation paradigm tailored to geography, integrating semantic segmentation (150 classes), quantified visual grounding scores, and difficulty-aware diagnostic mechanisms. Contribution/Results: Evaluated on a curated subset of 2,088 images, state-of-the-art MLLMs reveal critical bottlenecks in visual grounding, reasoning coherence, and complex localization. GeoCoT establishes the first reproducible, scalable, and diagnosable multimodal reasoning benchmark for geographic intelligence.
📝 Abstract
This paper introduces GeoChain, a large-scale benchmark for evaluating step-by-step geographic reasoning in multimodal large language models (MLLMs). Leveraging 1.46 million Mapillary street-level images, GeoChain pairs each image with a 21-step chain-of-thought (CoT) question sequence (over 30 million Q&A pairs). These sequences guide models from coarse attributes to fine-grained localization across four reasoning categories - visual, spatial, cultural, and precise geolocation - annotated by difficulty. Images are also enriched with semantic segmentation (150 classes) and a visual locatability score. Our benchmarking of contemporary MLLMs (GPT-4.1 variants, Claude 3.7, Gemini 2.5 variants) on a diverse 2,088-image subset reveals consistent challenges: models frequently exhibit weaknesses in visual grounding, display erratic reasoning, and struggle to achieve accurate localization, especially as the reasoning complexity escalates. GeoChain offers a robust diagnostic methodology, critical for fostering significant advancements in complex geographic reasoning within MLLMs.