🤖 AI Summary
Existing benchmarks inadequately evaluate foundational models’ geospatial reasoning capabilities due to limited scope, lack of multimodal tasks, and insufficient coverage of core geographic competencies. Method: We introduce MapEval—the first multimodal benchmark for map understanding—comprising 700 multiple-choice questions across 180 cities in 54 countries, spanning three task modalities: textual reasoning, API-based querying, and visual map interpretation. It assesses spatial relation understanding, map information extraction, and route planning. We further propose a heterogeneous geographic context fusion mechanism and a compositional spatial reasoning evaluation paradigm, integrating real-world map toolchains with multi-source geodata (POIs, distances, reviews, imagery). Contribution/Results: Comprehensive evaluation of 28 state-of-the-art models reveals Claude-3.5-Sonnet as the top-performing model overall; however, all models underperform humans by over 20% on average, with sub-40% accuracy on map image interpretation—highlighting critical bottlenecks in geospatial intelligence.
📝 Abstract
Recent advancements in foundation models have enhanced AI systems' capabilities in autonomous tool usage and reasoning. However, their ability in location or map-based reasoning - which improves daily life by optimizing navigation, facilitating resource discovery, and streamlining logistics - has not been systematically studied. To bridge this gap, we introduce MapEval, a benchmark designed to assess diverse and complex map-based user queries with geo-spatial reasoning. MapEval features three task types (textual, API-based, and visual) that require collecting world information via map tools, processing heterogeneous geo-spatial contexts (e.g., named entities, travel distances, user reviews or ratings, images), and compositional reasoning, which all state-of-the-art foundation models find challenging. Comprising 700 unique multiple-choice questions about locations across 180 cities and 54 countries, MapEval evaluates foundation models' ability to handle spatial relationships, map infographics, travel planning, and navigation challenges. Using MapEval, we conducted a comprehensive evaluation of 28 prominent foundation models. While no single model excelled across all tasks, Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro achieved competitive performance overall. However, substantial performance gaps emerged, particularly in MapEval, where agents with Claude-3.5-Sonnet outperformed GPT-4o and Gemini-1.5-Pro by 16% and 21%, respectively, and the gaps became even more amplified when compared to open-source LLMs. Our detailed analyses provide insights into the strengths and weaknesses of current models, though all models still fall short of human performance by more than 20% on average, struggling with complex map images and rigorous geo-spatial reasoning. This gap highlights MapEval's critical role in advancing general-purpose foundation models with stronger geo-spatial understanding.