Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents

📅 2026-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
Current vision-language models often misinterpret numerical values, generate hallucinations, or confuse overlapping elements in chart understanding due to their reliance solely on pixel-level information. This work proposes the Introspective and Interactive Visual Grounding (IVG) framework, which for the first time incorporates the underlying chart specification as deterministic evidence into visual reasoning. IVG integrates specification-guided introspective queries with dynamic view interaction strategies to overcome the limitations of purely pixel-based understanding. We introduce iPlotBench, a new evaluation benchmark free from VLM biases, comprising 500 interactive Plotly charts and 6,706 binary questions. Experiments demonstrate that IVG substantially improves data reconstruction fidelity, achieves a question-answering accuracy of 0.81, and yields a relative improvement of 6.7% on overlapping-element tasks, with successful deployment in real-time agents for both autonomous exploration and human-AI collaboration.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive charts as static images, losing access to the structured specification that encodes exact values. We introduce Introspective and Interactive Visual Grounding (IVG), a framework that combines (1) spec-grounded introspection, which queries the underlying specification for deterministic evidence, with (2) view-grounded interaction, which manipulates the view to resolve visual ambiguity. To enable evaluation without VLM bias, we present iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications. Experiments show that introspection improves data reconstruction fidelity, while the combination with interaction achieves the highest QA accuracy (0.81), with +6.7 % gains on overlapping geometries. We further demonstrate IVG in deployed agents that explore data autonomously and collaborate with human users in real time.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
interactive charts
visual grounding
structured specification
pixel-only bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introspective Visual Grounding
Interactive Visualization
Specification-Grounded Reasoning
Vision-Language Models
iPlotBench
🔎 Similar Papers
No similar papers found.