🤖 AI Summary
Current vision-language models often misinterpret numerical values, generate hallucinations, or confuse overlapping elements in chart understanding due to their reliance solely on pixel-level information. This work proposes the Introspective and Interactive Visual Grounding (IVG) framework, which for the first time incorporates the underlying chart specification as deterministic evidence into visual reasoning. IVG integrates specification-guided introspective queries with dynamic view interaction strategies to overcome the limitations of purely pixel-based understanding. We introduce iPlotBench, a new evaluation benchmark free from VLM biases, comprising 500 interactive Plotly charts and 6,706 binary questions. Experiments demonstrate that IVG substantially improves data reconstruction fidelity, achieves a question-answering accuracy of 0.81, and yields a relative improvement of 6.7% on overlapping-element tasks, with successful deployment in real-time agents for both autonomous exploration and human-AI collaboration.
📝 Abstract
Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive charts as static images, losing access to the structured specification that encodes exact values. We introduce Introspective and Interactive Visual Grounding (IVG), a framework that combines (1) spec-grounded introspection, which queries the underlying specification for deterministic evidence, with (2) view-grounded interaction, which manipulates the view to resolve visual ambiguity. To enable evaluation without VLM bias, we present iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications. Experiments show that introspection improves data reconstruction fidelity, while the combination with interaction achieves the highest QA accuracy (0.81), with +6.7 % gains on overlapping geometries. We further demonstrate IVG in deployed agents that explore data autonomously and collaborate with human users in real time.