🤖 AI Summary
Existing chart benchmarks predominantly target high-level reasoning tasks, failing to expose fine-grained deficiencies of large multimodal models (LMMs) in fundamental visual perception. To address this, we propose the first multi-granularity evaluation framework grounded in Gestalt perceptual theory, systematically diagnosing LMM perceptual limitations at chart-, visual-element-, and pixel-levels. Our method innovatively incorporates human visual perception principles into LMM assessment, enabling automated task synthesis and response parsing, and yielding a controllable, cross-chart-type, and reproducible benchmark. Experimental results—first of their kind—reveal fundamental weaknesses in state-of-the-art models (e.g., GPT-4o) across three critical dimensions: chart generalization, recognition of basic visual elements (e.g., bars, lines, markers), and intra-chart numerical cross-referencing. The framework and annotated dataset are publicly released to foster reproducible research.
📝 Abstract
Despite the promising results of large multimodal models (LMMs) in complex vision-language tasks that require knowledge, reasoning, and perception abilities together, we surprisingly found that these models struggle with simple tasks on infographics that require perception only. As existing benchmarks primarily focus on end tasks that require various abilities, they provide limited, fine-grained insights into the limitations of the models' perception abilities. To address this gap, we leverage the theory of graphical perception, an approach used to study how humans decode visual information encoded on charts and graphs, to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts. With automated task generation and response evaluation designs, our framework enables comprehensive and controlled testing of LMMs' graphical perception across diverse chart types, visual elements, and task types. We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three granularity levels (chart, visual element, and pixel). Our findings underscore several critical limitations of current state-of-the-art LMMs, including GPT-4o: their inability to (1) generalize across chart types, (2) understand fundamental visual elements, and (3) cross reference values within a chart. These insights provide guidance for future improvements in perception abilities of LMMs. The evaluation framework and labeled data are publicly available at https://github.com/microsoft/lmm-graphical-perception.