ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models (LVLMs) exhibit significant weaknesses in complex chart understanding—particularly in geometric reasoning, trend identification, and visual counting. Method: We introduce ChartMuseum, the first benchmark explicitly designed for complex visual reasoning in chart question answering, comprising 184 real-world charts and 1,162 expert-annotated questions spanning diverse reasoning types. Its difficulty is carefully calibrated: human accuracy reaches 93%, while state-of-the-art LVLMs show severe performance degradation (e.g., Gemini-2.5-Pro: 63.0%; Qwen2.5-VL-72B: 38.5%), underperforming their text-based reasoning capabilities by 35–55 percentage points. Contribution/Results: Through synthetic test-set attribution analysis and multi-dimensional error categorization, we systematically diagnose LVLMs’ visual reasoning deficiencies. ChartMuseum provides a reproducible, highly discriminative evaluation framework to guide model diagnostics, data curation, and architectural improvements for chart understanding.

Technology Category

Application Category

📝 Abstract
Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks -- where frontier models perform similarly and near saturation -- our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LVLMs' visual-text reasoning imbalance in chart understanding
Assessing performance drop with increasing visual complexity in charts
Identifying challenging visual reasoning categories for current LVLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset for visual reasoning evaluation
ChartMuseum benchmark with expert-annotated questions
Performance gap analysis between models and humans
🔎 Similar Papers
No similar papers found.
L
Liyan Tang
The University of Texas at Austin
G
Grace Kim
The University of Texas at Austin
Xinyu Zhao
Xinyu Zhao
The University of North Carolina at Chapel Hill
T
Thom Lake
The University of Texas at Austin
Wenxuan Ding
Wenxuan Ding
New York University
Natural Language Processing
F
Fangcong Yin
The University of Texas at Austin
P
Prasann Singhal
The University of Texas at Austin
Manya Wadhwa
Manya Wadhwa
PhD Student @ UT Austin
Natural Language Processing
Z
Zeyu Leo Liu
The University of Texas at Austin
Zayne Sprague
Zayne Sprague
Graduate Student at the University of Texas in Austin
Artificial IntelligenceMachine LearningDeep LearningNatural Language Understanding
R
Ramya Namuduri
The University of Texas at Austin
B
Bodun Hu
The University of Texas at Austin
Juan Diego Rodriguez
Juan Diego Rodriguez
UT Austin
NLPmachine learning
P
Puyuan Peng
The University of Texas at Austin
Greg Durrett
Greg Durrett
Associate Professor of Computer Science, New York University
Natural Language Processing