Towards Understanding Graphical Perception in Large Multimodal Models

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing chart benchmarks predominantly target high-level reasoning tasks, failing to expose fine-grained deficiencies of large multimodal models (LMMs) in fundamental visual perception. To address this, we propose the first multi-granularity evaluation framework grounded in Gestalt perceptual theory, systematically diagnosing LMM perceptual limitations at chart-, visual-element-, and pixel-levels. Our method innovatively incorporates human visual perception principles into LMM assessment, enabling automated task synthesis and response parsing, and yielding a controllable, cross-chart-type, and reproducible benchmark. Experimental results—first of their kind—reveal fundamental weaknesses in state-of-the-art models (e.g., GPT-4o) across three critical dimensions: chart generalization, recognition of basic visual elements (e.g., bars, lines, markers), and intra-chart numerical cross-referencing. The framework and annotated dataset are publicly released to foster reproducible research.

Technology Category

Application Category

📝 Abstract
Despite the promising results of large multimodal models (LMMs) in complex vision-language tasks that require knowledge, reasoning, and perception abilities together, we surprisingly found that these models struggle with simple tasks on infographics that require perception only. As existing benchmarks primarily focus on end tasks that require various abilities, they provide limited, fine-grained insights into the limitations of the models' perception abilities. To address this gap, we leverage the theory of graphical perception, an approach used to study how humans decode visual information encoded on charts and graphs, to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts. With automated task generation and response evaluation designs, our framework enables comprehensive and controlled testing of LMMs' graphical perception across diverse chart types, visual elements, and task types. We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three granularity levels (chart, visual element, and pixel). Our findings underscore several critical limitations of current state-of-the-art LMMs, including GPT-4o: their inability to (1) generalize across chart types, (2) understand fundamental visual elements, and (3) cross reference values within a chart. These insights provide guidance for future improvements in perception abilities of LMMs. The evaluation framework and labeled data are publicly available at https://github.com/microsoft/lmm-graphical-perception.
Problem

Research questions and friction points this paper is trying to address.

LMMs struggle with simple infographic perception tasks.
Existing benchmarks lack fine-grained insights into LMMs' perception limitations.
A new framework evaluates LMMs' graphical perception across diverse chart types.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages graphical perception theory for LMM evaluation
Automated task generation for controlled testing
Diagnoses LMM perception at three granularity levels
🔎 Similar Papers
No similar papers found.
K
Kai Zhang
The Ohio State University
Jianwei Yang
Jianwei Yang
Research Scientist, Meta SuperIntelligence Lab
Multimodal Agentic AI
J
J. Inala
Microsoft Research
Chandan Singh
Chandan Singh
Senior researcher, Microsoft research
🔍 Interpretability🤖 Foundation models🧠 Neuroscience🌳 Transparent models💊 Healthcare
J
Jianfeng Gao
Microsoft Research
Y
Yu Su
Microsoft Research
C
Chenglong Wang
Microsoft Research