ChartM$^3$: Benchmarking Chart Editing with Multimodal Instructions

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing chart editing methods rely solely on ambiguous natural language instructions, limiting their ability to support fine-grained, precise edits. Method: We propose a novel multimodal chart editing paradigm that jointly leverages natural language and visual directives (e.g., bounding boxes, arrows) to unambiguously specify editing intentions. We introduce ChartM³—the first benchmark dataset for multimodal chart editing—comprising 1,000 expert-annotated samples spanning four levels of editing complexity. We design dual evaluation metrics assessing both visual fidelity and code correctness. Further, we pioneer the integration of visual directives into MLLM-based chart editing, fine-tuning multimodal large models on ChartM³-Train (24,000 synthetic samples). Results: Experiments demonstrate substantial improvements in fine-grained editing performance. Our analysis reveals a critical bottleneck in current MLLMs’ ability to interpret visual cues, establishing ChartM³ as a foundational benchmark and providing a new technical pathway for vision-guided multimodal editing tasks.

Technology Category

Application Category

📝 Abstract
Charts are a fundamental visualization format widely used in data analysis across research and industry. While enabling users to edit charts based on high-level intentions is of great practical value, existing methods primarily rely on natural language instructions, which are often too ambiguous to support fine-grained editing. In this work, we introduce a novel paradigm for multimodal chart editing, where user intent is expressed through a combination of natural language and visual indicators that explicitly highlight the elements to be modified. To support this paradigm, we present Chart$ ext{M}^3$, a new benchmark for Multimodal chart editing with Multi-level complexity and Multi-perspective evaluation. Chart$ ext{M}^3$ contains 1,000 samples spanning four levels of editing difficulty. Each sample includes triplets in the form of (chart, code, multimodal instructions). To comprehensively evaluate chart editing models, Chart$ ext{M}^3$ provides metrics that assess both visual appearance and code correctness. Our benchmark reveals significant limitations in current multimodal large language models (MLLMs), including GPT-4o, particularly in their ability to interpret and act on visual indicators. To address this, we construct Chart$ ext{M}^3$-Train, a large-scale training set with 24,000 multimodal chart editing samples. Fine-tuning MLLMs on this dataset leads to substantial improvements, demonstrating the importance of multimodal supervision in building practical chart editing systems. Our datasets, codes, and evaluation tools are available at https://github.com/MLrollIT/ChartM3. %https://github.com/MLrollIT/ChartM3Our datasets, codes, and evaluation tools are available at https://github.com/yaolinli/VCE.
Problem

Research questions and friction points this paper is trying to address.

Enabling precise chart editing via multimodal instructions
Addressing ambiguity in natural language chart editing
Benchmarking MLLMs for visual and code accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal chart editing with visual indicators
Benchmark with multi-level complexity and evaluation
Large-scale training set for MLLM fine-tuning
🔎 Similar Papers
No similar papers found.
D
Danglu Yang
RUC
L
Liang Zhang
independent researcher
Zihao Yue
Zihao Yue
Renmin University of China
Multimodal AILanguage Modeling
L
Liangyu Chen
RUC
Y
Yichen Xu
RUC
W
Wenxuan Wang
RUC
Qin Jin
Qin Jin
中国人民大学信息学院
人工智能