Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing chart question-answering benchmarks struggle to rigorously evaluate the visual reasoning capabilities of vision-language models, as these models often rely on background knowledge or dataset-specific shortcuts rather than genuine chart comprehension. This work proposes Chartographer, a novel framework that introduces, for the first time, a counterfactual chart evaluation paradigm. By reverse-engineering original charts into executable code, the framework enables high-fidelity reconstruction and the generation of controlled chart variants that preserve the original questions while altering their correct answers through systematic visual modifications. Logical inference is then applied to derive the new ground-truth answers. Experiments across multiple chart QA datasets reveal that, despite strong performance on original charts, state-of-the-art models consistently fail on counterfactual charts requiring novel visual reasoning pathways, exposing significant deficiencies in their generalization and true reasoning abilities.
📝 Abstract
Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.
Problem

Research questions and friction points this paper is trying to address.

chart question-answering
visual reasoning
vision-language models
counterfactual evaluation
generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

counterfactual chart generation
visual reasoning evaluation
chart reverse engineering
vision-language models
generalizability testing
🔎 Similar Papers