🤖 AI Summary
Existing large multimodal models (LMMs) lack realistic, hierarchical, and user-centric evaluation frameworks for chart understanding and code generation (chart2code). To address this, we introduce Chart2Code—the first difficulty-graded, application-oriented multimodal benchmark for chart2code. It comprises three progressively challenging tiers: chart reconstruction, interactive editing, and long-table-to-chart generation. Evaluation integrates dual dimensions—code correctness and visual fidelity—validated via human annotation combined with automated verification across 22 chart types and 2,023 high-quality samples. We comprehensively evaluate 25 state-of-the-art (SOTA) multimodal models. Results reveal severe limitations: even the strongest model (e.g., GPT-5) achieves only 0.57 code accuracy and 0.22 visual quality score on editing tasks. These findings underscore the task’s inherent difficulty and expose critical model deficiencies. Chart2Code establishes a scalable, diagnostic evaluation paradigm to guide future research in multimodal chart understanding and generation.
📝 Abstract
We introduce Chart2Code, a new benchmark for evaluating the chart understanding and code generation capabilities of large multimodal models (LMMs). Chart2Code is explicitly designed from a user-driven perspective, capturing diverse real-world scenarios and progressively increasing task difficulty. It consists of three levels: Level 1 (Chart Reproduction) reproduces charts from a reference figure and user query; Level 2 (Chart Editing) involves complex modifications such as changing chart types or adding elements; and Level 3 (Long-Table to Chart Generation) requires models to transform long, information-dense tables into faithful charts following user instructions. To our knowledge, this is the first hierarchical benchmark that reflects practical chart2code usage while systematically scaling task complexity. In total, Chart2Code contains 2,023 tasks across 22 chart types, paired with multi-level evaluation metrics that assess both code correctness and the visual fidelity of rendered charts. We benchmark 25 state-of-the-art (SoTA) LMMs, including both proprietary and the latest open-source models such as GPT-5, Qwen2.5-VL, InternVL3/3.5, MiMo-VL, and Seed-1.6-VL. Experimental results demonstrate that even the SoTA model GPT-5 averages only 0.57 on code-based evaluation and 0.22 on chart-quality assessment across the editing tasks, underscoring the difficulty of Chart2Code. We anticipate this benchmark will drive advances in multimodal reasoning and foster the development of more robust and general-purpose LMMs. Our code and data are available on Chart2Code.