Effective Training Data Synthesis for Improving MLLM Chart Understanding

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing open-source multimodal large language models (MLLMs) exhibit weak performance on scientific chart understanding—achieving only 30–50% success rates—primarily due to low visual complexity and distributional mismatch between synthetic training data and real-world charts. To address this, we propose a five-step modular chart synthesis pipeline that decouples data generation from functional logic, explicitly models multi-subplot dependencies, and enhances visual detail diversity. Integrated with GPT-4o for high-quality QA pair generation, procedural chart rendering, conditional control, and rigorous quality filtering, it yields the ECD dataset (10k+ images, 300k+ QA pairs). ECD spans 25 academic domains and 250+ chart-type combinations. Fine-tuning MLLMs on ECD significantly improves performance on both real and synthetic benchmarks, with consistent generalization gains—marking the first work to jointly optimize high-fidelity, high-complexity scientific chart synthesis and large-scale, high-quality instruction tuning.

Technology Category

Application Category

📝 Abstract

Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30%-50% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts. In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually diversify the generated figures, filter out low quality data, and finally generate the question-answer (QA) pairs with GPT-4o. This approach allows us to streamline the generation of fine-tuning datasets and introduce the effective chart dataset (ECD), which contains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring 250+ chart type combinations with high visual complexity. We show that ECD consistently improves the performance of various MLLMs on a range of real-world and synthetic test sets. Code, data and models are available at: https://github.com/yuweiyang-anu/ECD.

Problem

Research questions and friction points this paper is trying to address.

Improving MLLM chart understanding with synthetic data

Overcoming limitations of dissimilar synthetic charts

Enhancing model performance on complex real-world charts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular chart generation for realistic synthesis

Diversified visual details enhance training data

Five-step pipeline ensures high-quality QA pairs

🔎 Similar Papers

On Pre-training of Multimodal Language Models Customized for Chart Understanding