🤖 AI Summary
This work addresses the lack of comprehensive evaluation of cross-modal reasoning capabilities in Large Multimodal Models (LMMs) for scientific chart understanding and code generation. To this end, we introduce ChartCodeBench—the first cross-modal benchmark specifically designed for scientific charts—built upon 4,800 human-annotated, high-fidelity triplets (chart, description, code) derived from real academic papers. It covers 18 major and 4 sub-categories, with 201 fine-grained chart types. We propose a multi-level automated evaluation framework integrating syntactic correctness, functional equivalence, and visual fidelity, enabling unified assessment of both open- and closed-source LMMs. Experimental results reveal substantial limitations: GPT-4o and InternVL2-Llama3-76B achieve only 82.2 and 61.6 average scores on Direct and Customized Mimic tasks, respectively, highlighting critical bottlenecks in complex cross-modal reasoning.
📝 Abstract
We introduce a new benchmark, ChartMimic, aimed at assessing the visually-grounded code generation capabilities of large multimodal models (LMMs). ChartMimic utilizes information-intensive visual charts and textual instructions as inputs, requiring LMMs to generate the corresponding code for chart rendering. ChartMimic includes 4,800 human-curated (figure, instruction, code) triplets, which represent the authentic chart use cases found in scientific papers across various domains (e.g., Physics, Computer Science, Economics, etc). These charts span 18 regular types and 4 advanced types, diversifying into 201 subcategories. Furthermore, we propose multi-level evaluation metrics to provide an automatic and thorough assessment of the output code and the rendered charts. Unlike existing code generation benchmarks, ChartMimic places emphasis on evaluating LMMs' capacity to harmonize a blend of cognitive capabilities, encompassing visual understanding, code generation, and cross-modal reasoning. The evaluation of $3$ proprietary models and 14 open-weight models highlights the substantial challenges posed by ChartMimic. Even the advanced GPT-4o, InternVL2-Llama3-76B only achieved an average score across Direct Mimic and Customized Mimic tasks of 82.2 and 61.6, respectively, indicating significant room for improvement. We anticipate that ChartMimic will inspire the development of LMMs, advancing the pursuit of artificial general intelligence.