Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) heavily rely on textual cues and statistical shortcuts for chart understanding, exhibiting weak visual reasoning capabilities. Method: We propose a rigorous evaluation paradigm that removes chart text labels and introduces structural/stylistic perturbations—exposing severe visual deficiencies in state-of-the-art models (e.g., GPT-4o, Gemini 2.0 Pro). To enable genuinely vision-based symbolic reasoning, we introduce a novel multi-agent collaborative framework: a generative agent performs SVG inverse parsing to extract geometric primitives (e.g., bar heights, coordinates), while a critical agent dynamically verifies representation fidelity, establishing a joint visual–symbolic modeling mechanism. Contribution/Results: On the enhanced ChartQA benchmark, our method improves primitive extraction accuracy by 32.7% and reduces performance degradation under perturbations by 68.4% compared to prior SOTA. It establishes a new paradigm for interpretable and robust chart visual reasoning.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable versatility but face challenges in demonstrating true visual understanding, particularly in chart reasoning tasks. Existing benchmarks like ChartQA reveal significant reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning. To rigorously evaluate visual reasoning, we introduce a more challenging test scenario by removing textual labels and introducing chart perturbations in the ChartQA dataset. Under these conditions, models like GPT-4o and Gemini-2.0 Pro experience up to a 30% performance drop, underscoring their limitations. To address these challenges, we propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics (SVG) representations, enabling MLLMs to integrate textual and visual modalities for enhanced chart understanding. Socratic Chart employs a multi-agent pipeline with specialized agent-generators to extract primitive chart attributes (e.g., bar heights, line coordinates) and an agent-critic to validate results, ensuring high-fidelity symbolic representations. Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance, establishing a robust pathway for advancing MLLM visual understanding.

Problem

Research questions and friction points this paper is trying to address.

Evaluating true visual understanding in chart reasoning tasks

Addressing reliance on text shortcuts in chart comprehension

Improving multimodal models' accuracy in interpreting SVG charts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts chart images to SVG representations

Uses multi-agent pipeline for attribute extraction

Employs agent-critic for validation and accuracy

🔎 Similar Papers

VProChart: Answering Chart Question through Visual Perception Alignment Agent and Programmatic Solution Reasoning