🤖 AI Summary
Existing multimodal large language models (MLLMs) lack systematic evaluation for chart perception and affective impact prediction, risking overgeneralization. Method: We introduce the first benchmark dataset for evaluating “chart experience impact,” comprising 36 diverse charts annotated by crowdsourcing across seven human perceptual and affective dimensions, supporting both single-chart prediction and chart-pair comparison tasks. We formally define and quantify “experience impact” and propose a novel multidimensional evaluation framework integrating perceptual and affective signals. Contribution/Results: Evaluating state-of-the-art MLLMs—including LLaVA, Qwen-VL, and Gemini—via zero-shot and few-shot prompting, we achieve >85% accuracy on chart-pair comparison, approaching human-level performance; however, single-chart prediction remains significantly weaker, revealing fundamental limitations in deep reasoning. This work establishes a new benchmark, paradigm, and conceptual foundation for intelligent chart understanding.
📝 Abstract
The field of Multimodal Large Language Models (MLLMs) has made remarkable progress in visual understanding tasks, presenting a vast opportunity to predict the perceptual and emotional impact of charts. However, it also raises concerns, as many applications of LLMs are based on overgeneralized assumptions from a few examples, lacking sufficient validation of their performance and effectiveness. We introduce Chart-to-Experience, a benchmark dataset comprising 36 charts, evaluated by crowdsourced workers for their impact on seven experiential factors. Using the dataset as ground truth, we evaluated capabilities of state-of-the-art MLLMs on two tasks: direct prediction and pairwise comparison of charts. Our findings imply that MLLMs are not as sensitive as human evaluators when assessing individual charts, but are accurate and reliable in pairwise comparisons.