🤖 AI Summary
This work addresses key limitations of multimodal large language models in chart understanding—namely, the scarcity of high-quality training data, challenges in fine-grained visual grounding, and insufficient numerical reasoning accuracy. To overcome these issues, the authors propose DuoChart, a novel framework that constructs a scalable dual-source training set by combining synthetic and real-world data. DuoChart uniquely integrates image cropping and code execution tools deeply into the multimodal reasoning pipeline and employs agent-based reinforcement learning to enable content-grounded tool invocation. Evaluated across six chart understanding benchmarks, the approach substantially outperforms comparable models, with CharTool-7B achieving absolute gains of 8.0% on CharXiv (reasoning) and 9.78% on ChartQAPro, while also demonstrating strong out-of-domain generalization in visual-mathematical reasoning tasks.
📝 Abstract
Charts are ubiquitous in scientific and financial literature for presenting structured data. However, chart reasoning remains challenging for multimodal large language models (MLLMs) due to the lack of high-quality training data, as well as the need for fine-grained visual grounding and precise numerical computation. To address these challenges, we first propose DuoChart, a scalable dual-source data pipeline that combines synthesized charts with real-world charts to construct diverse, high-quality chart training data. We then introduce CharTool, which equips MLLMs with external tools, including image cropping for localized visual perception and code-based computation for accurate numerical reasoning. Through agentic reinforcement learning on DuoChart, CharTool learns tool-integrated reasoning grounded in chart content. Extensive experiments on six chart benchmarks show that our method consistently improves over strong MLLM baselines across model scales. Notably, CharTool-7B outperforms the base model by **+8.0%** on CharXiv (Reasoning) and **+9.78%** on ChartQAPro, while achieving competitive performance with substantially larger or proprietary models. Moreover, CharTool demonstrates positive generalization to out-of-domain visual math reasoning benchmarks.