🤖 AI Summary
Current large vision-language models (LVLMs) exhibit significant weaknesses in complex chart understanding—particularly in geometric reasoning, trend identification, and visual counting. Method: We introduce ChartMuseum, the first benchmark explicitly designed for complex visual reasoning in chart question answering, comprising 184 real-world charts and 1,162 expert-annotated questions spanning diverse reasoning types. Its difficulty is carefully calibrated: human accuracy reaches 93%, while state-of-the-art LVLMs show severe performance degradation (e.g., Gemini-2.5-Pro: 63.0%; Qwen2.5-VL-72B: 38.5%), underperforming their text-based reasoning capabilities by 35–55 percentage points. Contribution/Results: Through synthetic test-set attribution analysis and multi-dimensional error categorization, we systematically diagnose LVLMs’ visual reasoning deficiencies. ChartMuseum provides a reproducible, highly discriminative evaluation framework to guide model diagnostics, data curation, and architectural improvements for chart understanding.
📝 Abstract
Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks -- where frontier models perform similarly and near saturation -- our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.