ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit over-reliance on OCR for chart understanding, leading to numerical hallucinations under sparse annotations and weak visual grounding—particularly in precise spatial localization. To address this, we propose PointCoT, a reasoning framework that dynamically aligns textual reasoning chains with visual regions by generating element-level bounding boxes and re-rendering charts, thereby enhancing structural and proportional grounding. We introduce ChartPoint-SFT-62k, a large-scale, high-quality dataset integrating instruction tuning, chain-of-thought (CoT) reasoning, visual grounding, and automated data synthesis. Leveraging this, we train ChartPointQ2 and ChartPointQ2.5 models. On benchmarks including ChartBench, our models achieve a +5.04% improvement over state-of-the-art methods, demonstrating superior reasoning accuracy and interpretability.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have emerged as powerful tools for chart comprehension. However, they heavily rely on extracted content via OCR, which leads to numerical hallucinations when chart textual annotations are sparse. While existing methods focus on scaling instructions, they fail to address the fundamental challenge, i.e., reasoning with visual perception. In this paper, we identify a critical observation: MLLMs exhibit weak grounding in chart elements and proportional relationships, as evidenced by their inability to localize key positions to match their reasoning. To bridge this gap, we propose PointCoT, which integrates reflective interaction into chain-of-thought reasoning in charts. By prompting MLLMs to generate bounding boxes and re-render charts based on location annotations, we establish connections between textual reasoning steps and visual grounding regions. We further introduce an automated pipeline to construct ChartPoint-SFT-62k, a dataset featuring 19.2K high-quality chart samples with step-by-step CoT, bounding box, and re-rendered visualizations. Leveraging this data, we develop two instruction-tuned models, ChartPointQ2 and ChartPointQ2.5, which outperform state-of-the-art across several chart benchmarks, e.g., +5.04% on ChartBench.
Problem

Research questions and friction points this paper is trying to address.

Addresses MLLMs' numerical hallucinations from sparse chart text
Improves grounding in chart elements and proportional relationships
Connects textual reasoning steps with visual grounding regions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates reflective interaction into chain-of-thought reasoning
Prompts MLLMs to generate bounding boxes and re-render charts
Uses automated pipeline to construct dataset with CoT annotations
Zhengzhuo Xu
Zhengzhuo Xu
Tsinghua University
S
SiNan Du
Tsinghua University
Yiyan Qi
Yiyan Qi
IDEA
S
SiwenLu
Beihang University
Chengjin Xu
Chengjin Xu
International Digital Economy Academy & DataArc Tech Inc.
Deep LearningNatural Language ProcessLarge Language ModelKnowledge Graph
C
Chun Yuan
Tsinghua University
J
Jian Guo
International Digital Economy Academy, Hong Kong University of Science and Technology (Guangzhou)