🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit strong reliance on choropleth maps for Map-VQA tasks, with no systematic evaluation across diverse map types—such as statistical maps and proportional symbol maps—or thematic domains like housing and crime. To address this gap, we introduce MapIQ, the first MLLM-oriented benchmark for map question answering, covering three map types, six thematic domains, and six visual analytical tasks, augmented by controlled experiments on cartographic design variables. Our contributions include: (i) the first comprehensive coverage of thematic map diversity; (ii) establishment of human performance baselines and rigorous robustness evaluation; and (iii) empirical revelation of MLLMs’ heavy dependence on geographic prior knowledge and high sensitivity to design elements—including color schemes and legends. Experiments demonstrate that state-of-the-art MLLMs significantly underperform humans on complex map understanding, underscoring the critical need for domain-specific evaluation frameworks.
📝 Abstract
Recent advancements in multimodal large language models (MLLMs) have driven researchers to explore how well these models read data visualizations, e.g., bar charts, scatter plots. More recently, attention has shifted to visual question answering with maps (Map-VQA). However, Map-VQA research has primarily focused on choropleth maps, which cover only a limited range of thematic categories and visual analytical tasks. To address these gaps, we introduce MapIQ, a benchmark dataset comprising 14,706 question-answer pairs across three map types: choropleth maps, cartograms, and proportional symbol maps spanning topics from six distinct themes (e.g., housing, crime). We evaluate multiple MLLMs using six visual analytical tasks, comparing their performance against one another and a human baseline. An additional experiment examining the impact of map design changes (e.g., altered color schemes, modified legend designs, and removal of map elements) provides insights into the robustness and sensitivity of MLLMs, their reliance on internal geographic knowledge, and potential avenues for improving Map-VQA performance.