ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing chart understanding methods rely heavily on textual logical reasoning (e.g., Chain-of-Thought), rendering them unable to correct errors stemming from visual perception biases. To address this, we propose Sketch-CoT: an iterative vision–language collaborative reasoning framework grounded in programmable sketch annotation. Its core innovation lies in enabling the model to directly generate executable sketches over charts and close the visual feedback loop within multi-step reasoning. We adopt a two-stage training paradigm—cold-start supervised learning followed by off-policy reinforcement learning—to achieve embodied visual reflection and generalization. Sketch-CoT achieves significant performance gains across multiple chart understanding benchmarks and general vision tasks. It offers strong interpretability, natural human–machine interaction capability, and promising cross-task transferability, advancing beyond purely language-based reasoning toward visually grounded, executable cognition.

Technology Category

Application Category

📝 Abstract
Charts are high-density visualization carriers for complex data, serving as a crucial medium for information extraction and analysis. Automated chart understanding poses significant challenges to existing multimodal large language models (MLLMs) due to the need for precise and complex visual reasoning. Current step-by-step reasoning models primarily focus on text-based logical reasoning for chart understanding. However, they struggle to refine or correct their reasoning when errors stem from flawed visual understanding, as they lack the ability to leverage multimodal interaction for deeper comprehension. Inspired by human cognitive behavior, we propose ChartSketcher, a multimodal feedback-driven step-by-step reasoning method designed to address these limitations. ChartSketcher is a chart understanding model that employs Sketch-CoT, enabling MLLMs to annotate intermediate reasoning steps directly onto charts using a programmatic sketching library, iteratively feeding these visual annotations back into the reasoning process. This mechanism enables the model to visually ground its reasoning and refine its understanding over multiple steps. We employ a two-stage training strategy: a cold start phase to learn sketch-based reasoning patterns, followed by off-policy reinforcement learning to enhance reflection and generalization. Experiments demonstrate that ChartSketcher achieves promising performance on chart understanding benchmarks and general vision tasks, providing an interactive and interpretable approach to chart comprehension.
Problem

Research questions and friction points this paper is trying to address.

Automated chart understanding challenges MLLMs due to complex visual reasoning
Current models lack multimodal interaction for error correction in visual understanding
ChartSketcher uses sketching feedback to improve reasoning and chart comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal feedback-driven step-by-step reasoning method
Programmatic sketching library for visual annotations
Two-stage training with reinforcement learning
M
Muye Huang
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China; MOE KLINNS Lab, Xi’an Jiaotong University, Xi’an, 710049, China; Zhongguancun Academy, Beijing, 100094, China
Lingling Zhang
Lingling Zhang
Assistant Professor, Xi'an Jiaotong University
Computer visionFew-shot learningZero-shot learning
J
Jie Ma
MOE KLINNS Lab, Xi’an Jiaotong University, Xi’an, 710049, China
Han Lai
Han Lai
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China; MOE KLINNS Lab, Xi’an Jiaotong University, Xi’an, 710049, China
Fangzhi Xu
Fangzhi Xu
Xi'an Jiaotong University | Nanyang Technological University
Large Language ModelsSelf-TrainingReasoningGUI Agents
Y
Yifei Li
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China; MOE KLINNS Lab, Xi’an Jiaotong University, Xi’an, 710049, China; Zhongguancun Academy, Beijing, 100094, China
W
Wenjun Wu
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China; MOE KLINNS Lab, Xi’an Jiaotong University, Xi’an, 710049, China
Yaqiang Wu
Yaqiang Wu
Lenovo
J
Jun Liu
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China; MOE KLINNS Lab, Xi’an Jiaotong University, Xi’an, 710049, China