CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current visual language models (VLMs) lack rigorous, human-centered evaluation for data visualization literacy. Method: This study introduces the first standardized, human-oriented assessment framework to systematically evaluate eight VLMs across six chart comprehension tasks, benchmarking their zero-shot performance against human responses via expert annotation, error-pattern analysis, correlation testing, and multi-dimensional behavioral comparison. Contribution/Results: All models significantly underperform humans—even under relaxed scoring criteria—exhibiting only weak behavioral correlation and systematically distinct error distributions unrelated to known human cognitive biases. The work reveals fundamental limitations in VLMs’ visualization understanding and pioneers the integration of human cognitive assessment paradigms into AI evaluation, thereby establishing a novel foundation for cognitively grounded modeling and interpretability research in visual language understanding.

Technology Category

Application Category

📝 Abstract
Data visualizations are powerful tools for communicating patterns in quantitative data. Yet understanding any data visualization is no small feat -- succeeding requires jointly making sense of visual, numerical, and linguistic inputs arranged in a conventionalized format one has previously learned to parse. Recently developed vision-language models are, in principle, promising candidates for developing computational models of these cognitive operations. However, it is currently unclear to what degree these models emulate human behavior on tasks that involve reasoning about data visualizations. This gap reflects limitations in prior work that has evaluated data visualization understanding in artificial systems using measures that differ from those typically used to assess these abilities in humans. Here we evaluated eight vision-language models on six data visualization literacy assessments designed for humans and compared model responses to those of human participants. We found that these models performed worse than human participants on average, and this performance gap persisted even when using relatively lenient criteria to assess model performance. Moreover, while relative performance across items was somewhat correlated between models and humans, all models produced patterns of errors that were reliably distinct from those produced by human participants. Taken together, these findings suggest significant opportunities for further development of artificial systems that might serve as useful models of how humans reason about data visualizations. All code and data needed to reproduce these results are available at: https://osf.io/e25mu/?view_only=399daff5a14d4b16b09473cf19043f18.
Problem

Research questions and friction points this paper is trying to address.

Evaluating vision-language models' data visualization understanding compared to humans
Assessing model performance on human-designed visualization literacy tasks
Identifying gaps between human and model reasoning about visualizations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated vision-language models on human-designed assessments
Compared model responses with human participants systematically
Identified performance gaps and distinct error patterns