๐ค AI Summary
Chart editing poses a significant challenge for multimodal large language models (MLLMs), yet existing evaluations rely on ad hoc case studies and lack systematic benchmarks. To address this gap, we introduce ChartEditโthe first high-quality, instruction-driven benchmark specifically designed for chart editing. It comprises 1,405 human-annotated instructions and 233 real-world charts, enabling fine-grained evaluation through a dual-level assessment protocol: code-level correctness and chart-level visual fidelity. Using ChartEdit, we systematically evaluate 10 state-of-the-art MLLMs; the best-performing model achieves only 59.96/100 on average, revealing critical limitations in precise intent understanding and controllable, faithful editing. This work establishes the first reproducible, extensible evaluation infrastructure for chart editing capabilities, filling a fundamental gap in MLLM assessment and providing a foundation for future research.
๐ Abstract
Although multimodal large language models (MLLMs) show promise in generating chart rendering code, chart editing presents a greater challenge. This difficulty stems from its nature as a labor-intensive task for humans that also demands MLLMs to integrate chart understanding, complex reasoning, and precise intent interpretation. While many MLLMs claim such editing capabilities, current assessments typically rely on limited case studies rather than robust evaluation methodologies, highlighting the urgent need for a comprehensive evaluation framework. In this work, we propose ChartEdit, a new high-quality benchmark designed for chart editing tasks. This benchmark comprises $1,405$ diverse editing instructions applied to $233$ real-world charts, with each instruction-chart instance having been manually annotated and validated for accuracy. Utilizing ChartEdit, we evaluate the performance of 10 mainstream MLLMs across two types of experiments, assessing them at both the code and chart levels. The results suggest that large-scale models can generate code to produce images that partially match the reference images. However, their ability to generate accurate edits according to the instructions remains limited. The state-of-the-art (SOTA) model achieves a score of only $59.96$, highlighting significant challenges in precise modification. In contrast, small-scale models, including chart-domain models, struggle both with following editing instructions and generating overall chart images, underscoring the need for further development in this area. Code is available at https://github.com/xxlllz/ChartEdit.