FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models (LVLMs) treat maps merely as charts, failing to model hierarchical cartographic symbol systems—comprising symbols, geometry, and text—as well as multidimensional spatial relations (topological, metric, and directional). Crucially, they lack cross-map, multi-step spatial reasoning capabilities. To address this, we introduce FRIEDA, the first benchmark for cartographic intelligence. Built upon GIS taxonomy, it comprises a real-world, multi-source map visual question answering (VQA) dataset. We systematically define and evaluate two core tasks: cross-map association and multi-step spatial reasoning, covering three types of spatial relations, and propose dual evaluation modes—direct and contextual. Evaluated on 11 state-of-the-art LVLMs, the top-performing models—Gemini-2.5-Pro and GPT-5-Think—achieve only 38.20% and 37.20% accuracy, respectively, far below human performance (84.87%). This reveals a fundamental gap in spatial intelligence and bridges the long-standing assessment divide between chart understanding and geospatial reasoning.

Technology Category

Application Category

📝 Abstract
Cartographic reasoning is the skill of interpreting geographic relationships by aligning legends, map scales, compass directions, map texts, and geometries across one or more map images. Although essential as a concrete cognitive capability and for critical tasks such as disaster response and urban planning, it remains largely unevaluated. Building on progress in chart and infographic understanding, recent large vision language model studies on map visual question-answering often treat maps as a special case of charts. In contrast, map VQA demands comprehension of layered symbology (e.g., symbols, geometries, and text labels) as well as spatial relations tied to orientation and distance that often span multiple maps and are not captured by chart-style evaluations. To address this gap, we introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs. FRIEDA sources real map images from documents and reports in various domains and geographical areas. Following classifications in Geographic Information System (GIS) literature, FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation). All questions require multi-step inference, and many require cross-map grounding and reasoning. We evaluate eleven state-of-the-art LVLMs under two settings: (1) the direct setting, where we provide the maps relevant to the question, and (2) the contextual setting, where the model may have to identify the maps relevant to the question before reasoning. Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20% and 37.20% accuracy, respectively, far below human performance of 84.87%. These results reveal a persistent gap in multi-step cartographic reasoning, positioning FRIEDA as a rigorous benchmark to drive progress on spatial intelligence in LVLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluates multi-step cartographic reasoning in vision-language models
Tests understanding of layered map symbology and spatial relations
Assesses cross-map grounding and inference across diverse geographic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FRIEDA benchmark for multi-step cartographic reasoning
Tests spatial relations: topological, metric, and directional across maps
Evaluates LVLMs in direct and contextual map reasoning settings
🔎 Similar Papers
No similar papers found.
J
Jiyoon Pyo
University of Minnesota-Twin Cities
Y
Yuankun Jiao
University of Minnesota-Twin Cities
Dongwon Jung
Dongwon Jung
Ph.D. Student, UC Davis
Natural Language Processing
Z
Zekun Li
University of Minnesota-Twin Cities
L
Leeje Jang
University of Minnesota-Twin Cities
S
Sofia Kirsanova
University of Minnesota-Twin Cities
J
Jina Kim
University of Minnesota-Twin Cities
Yijun Lin
Yijun Lin
University of Minnesota, Twin Cities
Spatiotemporal PredictionMachine Learning
Q
Qin Liu
University of California, Davis
J
Junyi Xie
University of Minnesota-Twin Cities
Hadi Askari
Hadi Askari
UC Davis
Machine LearningExplainable AINLPComputer VisionComputational Social Science
N
Nan Xu
Google
Muhao Chen
Muhao Chen
Assistant Professor of Computer Science, University of California, Davis
Natural Language ProcessingRobust MLAI SafetyVision-language Models
Yao-Yi Chiang
Yao-Yi Chiang
Associate Professor, Computer Science & Engineering, University of Minnesota
spatial AIdata miningmachine learninggeographic information sciencecomputer vision