ChartEdit: How Far Are MLLMs From Automating Chart Analysis? Evaluating MLLMs' Capability via Chart Editing

๐Ÿ“… 2025-05-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Chart editing poses a significant challenge for multimodal large language models (MLLMs), yet existing evaluations rely on ad hoc case studies and lack systematic benchmarks. To address this gap, we introduce ChartEditโ€”the first high-quality, instruction-driven benchmark specifically designed for chart editing. It comprises 1,405 human-annotated instructions and 233 real-world charts, enabling fine-grained evaluation through a dual-level assessment protocol: code-level correctness and chart-level visual fidelity. Using ChartEdit, we systematically evaluate 10 state-of-the-art MLLMs; the best-performing model achieves only 59.96/100 on average, revealing critical limitations in precise intent understanding and controllable, faithful editing. This work establishes the first reproducible, extensible evaluation infrastructure for chart editing capabilities, filling a fundamental gap in MLLM assessment and providing a foundation for future research.

Technology Category

Application Category

๐Ÿ“ Abstract
Although multimodal large language models (MLLMs) show promise in generating chart rendering code, chart editing presents a greater challenge. This difficulty stems from its nature as a labor-intensive task for humans that also demands MLLMs to integrate chart understanding, complex reasoning, and precise intent interpretation. While many MLLMs claim such editing capabilities, current assessments typically rely on limited case studies rather than robust evaluation methodologies, highlighting the urgent need for a comprehensive evaluation framework. In this work, we propose ChartEdit, a new high-quality benchmark designed for chart editing tasks. This benchmark comprises $1,405$ diverse editing instructions applied to $233$ real-world charts, with each instruction-chart instance having been manually annotated and validated for accuracy. Utilizing ChartEdit, we evaluate the performance of 10 mainstream MLLMs across two types of experiments, assessing them at both the code and chart levels. The results suggest that large-scale models can generate code to produce images that partially match the reference images. However, their ability to generate accurate edits according to the instructions remains limited. The state-of-the-art (SOTA) model achieves a score of only $59.96$, highlighting significant challenges in precise modification. In contrast, small-scale models, including chart-domain models, struggle both with following editing instructions and generating overall chart images, underscoring the need for further development in this area. Code is available at https://github.com/xxlllz/ChartEdit.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' capability in chart editing tasks
Assessing integration of chart understanding and reasoning
Developing a benchmark for precise chart modifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes ChartEdit benchmark for chart editing tasks
Evaluates 10 MLLMs on code and chart levels
Highlights challenges in precise chart modification
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xuanle Zhao
Institute of Automation, Chinese Academy of Sciences, Beijing, China
X
Xuexin Liu
Institute of Automation, Chinese Academy of Sciences, Beijing, China
H
Haoyue Yang
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Xianzhen Luo
Xianzhen Luo
Harbin Institute of Technology
Code IntelligenceInference Acceleration
Fanhu Zeng
Fanhu Zeng
Institute of Automation, Chinese Academy of Sciences
Multimodal LLMTrustworthy AIEfficient Learning
J
Jianling Li
Tianjin University, Tianjin, China
Q
Qi Shi
Tsinghua University, Beijing, China
C
Chi Chen
Tsinghua University, Beijing, China