ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing chart visual question answering (VQA) methods suffer substantial performance degradation on unlabeled charts, primarily due to overreliance on textual cues while neglecting precise visual parsing. To address this, we propose the first tool-augmented multimodal agent framework that emulates human cognition by performing fine-grained, coordinate-space operations—such as axis localization, region cropping, and structural annotation—directly on the image, enabling interactive visual reasoning. Our framework tightly couples a multimodal large language model (MLLM) with a dedicated visual tool library, supporting executable-action-driven iterative inference. Evaluated on ChartBench and ChartX benchmarks, it achieves state-of-the-art performance, improving overall accuracy by up to 16.07% and attaining a 17.31% gain on unlabeled, numerically dense questions. Moreover, it exhibits strong generalization across diverse mainstream LLM backbones.

Technology Category

Application Category

📝 Abstract
Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts, those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieve the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.
Problem

Research questions and friction points this paper is trying to address.

Addresses performance decline on unannotated charts
Performs visual reasoning in chart's spatial domain
Decomposes queries into visual subtasks iteratively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic framework for visual reasoning in charts
Iterative decomposition into visual subtasks with annotations
Chart-specific vision tools for spatial manipulation
🔎 Similar Papers
No similar papers found.