ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Current large vision-language models (LVLMs) exhibit significant weaknesses in complex chart understanding—particularly in geometric reasoning, trend identification, and visual counting. Method: We introduce ChartMuseum, the first benchmark explicitly designed for complex visual reasoning in chart question answering, comprising 184 real-world charts and 1,162 expert-annotated questions spanning diverse reasoning types. Its difficulty is carefully calibrated: human accuracy reaches 93%, while state-of-the-art LVLMs show severe performance degradation (e.g., Gemini-2.5-Pro: 63.0%; Qwen2.5-VL-72B: 38.5%), underperforming their text-based reasoning capabilities by 35–55 percentage points. Contribution/Results: Through synthetic test-set attribution analysis and multi-dimensional error categorization, we systematically diagnose LVLMs’ visual reasoning deficiencies. ChartMuseum provides a reproducible, highly discriminative evaluation framework to guide model diagnostics, data curation, and architectural improvements for chart understanding.

Technology Category

Application Category

📝 Abstract

Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks -- where frontier models perform similarly and near saturation -- our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LVLMs' visual-text reasoning imbalance in chart understanding

Assessing performance drop with increasing visual complexity in charts

Identifying challenging visual reasoning categories for current LVLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset for visual reasoning evaluation

ChartMuseum benchmark with expert-annotated questions

Performance gap analysis between models and humans

🔎 Similar Papers

Have Large Vision-Language Models Mastered Art History?