ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing chart captioning benchmarks are overly simplistic and lack comprehensive evaluation of the faithfulness and insightfulness of descriptions generated by multimodal large language models (MLLMs). To address this gap, this work proposes ChartFI-Bench, a high-quality benchmark grounded in a four-dimensional quality framework encompassing factual accuracy, salient feature emphasis, domain guidance, and visual-textual complementarity. The benchmark comprises human-crafted complex visualizations paired with semantically rich reference captions, along with four aligned evaluation metrics: Faithfulness, Coverage, Informativeness, and Acuity. Experimental results demonstrate that the proposed framework effectively uncovers significant deficiencies in state-of-the-art MLLMs when generating descriptions that are both factually faithful and insightful.

📝 Abstract

Chart descriptions are essential for accessibility, cross-modal retrieval, and assisting readers in extracting insights from complex visualizations. As multimodal large language models (MLLMs) are increasingly adopted for automated chart description generation, a critical question arises: how faithfully and insightfully do these models actually describe charts? Current benchmarks fall short on two fronts: existing datasets consist of simple, homogeneous charts paired with shallow, fact-enumerating descriptions; and prevailing metrics fail to capture the multi-faceted nature of description quality. To address these gaps, we present the Chart Faithfulness and Insightfulness Benchmark (ChartFI-Bench). We first summarize four dimensions that characterize high-quality chart descriptions: factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity. Guided by these dimensions, we construct a high-quality benchmark comprising 896 chart-description pairs, which feature visually complex charts and semantically rich descriptions. Furthermore, we design four aligned evaluation metrics -- Faithfulness, Coverage, Informativeness, and Acuity -- to systematically assess the quality of descriptions across these dimensions. Experiments conducted on mainstream MLLMs demonstrate the effectiveness of the proposed framework and reveal common weaknesses among existing models.

Problem

Research questions and friction points this paper is trying to address.

chart description

multimodal large language models

faithfulness

insightfulness

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chart Description

Multimodal Large Language Models

Faithfulness