ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

career value

142K/year

🤖 AI Summary

Existing chart question answering (CQA) benchmarks suffer from severe homogeneity, failing to reflect real-world diversity and leading to performance saturation in large vision-language models (LVLMs). Method: We introduce ChartQAPro—the first high-difficulty, reality-oriented CQA benchmark—comprising 157 real-world sources, 1,341 multi-type charts (including infographics and dashboards), and 1,948 multimodal questions. It pioneers three key innovations: (1) multi-source heterogeneous charts, (2) non-standard question types (e.g., hypothetical and unsolvable questions), and (3) a fine-grained error analysis framework, all constructed via expert annotation and cross-domain sampling. Contribution/Results: Comprehensive evaluation across 21 state-of-the-art LVLMs reveals a dramatic average accuracy drop exceeding 30 percentage points (e.g., Claude Sonnet 3.5 declines from 90.5% to 55.81%), significantly enhancing assessment realism and challenge level while overcoming critical limitations of prior benchmarks.

Technology Category

Application Category

📝 Abstract

Charts are ubiquitous, as people often use them to analyze data, answer questions, and discover critical insights. However, performing complex analytical tasks with charts requires significant perceptual and cognitive effort. Chart Question Answering (CQA) systems automate this process by enabling models to interpret and reason with visual representations of data. However, existing benchmarks like ChartQA lack real-world diversity and have recently shown performance saturation with modern large vision-language models (LVLMs). To address these limitations, we introduce ChartQAPro, a new benchmark that includes 1,341 charts from 157 diverse sources, spanning various chart types, including infographics and dashboards, and featuring 1,948 questions in various types, such as multiple-choice, conversational, hypothetical, and unanswerable questions, to better reflect real-world challenges. Our evaluations with 21 models show a substantial performance drop for LVLMs on ChartQAPro; e.g., Claude Sonnet 3.5 scores 90.5% on ChartQA but only 55.81% on ChartQAPro, underscoring the complexity of chart reasoning. We complement our findings with detailed error analyses and ablation studies, identifying key challenges and opportunities for advancing LVLMs in chart understanding and reasoning. We release ChartQAPro at https://github.com/vis-nlp/ChartQAPro.

Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks lack diversity and show performance saturation.

ChartQAPro introduces diverse charts and complex question types.

Performance drop in models highlights chart reasoning challenges.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces diverse 1,341 charts from 157 sources

Includes multiple question types for real-world challenges

Benchmarks 21 models showing significant performance drop

🔎 Similar Papers

EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding