VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models struggle to reliably detect visual primitives in charts and achieve semantic alignment, limiting their capacity for complex visual reasoning. To address this, this work proposes a Decomposition-of-Thought (DoT) framework that separates perception from logical reasoning, inspired by human graphical perception theory. The approach decomposes chart understanding into distinct visual perception and logical reasoning stages through four carefully designed perception tasks. Building upon InternVL, the model is fine-tuned with a novel DoT prompting strategy and evaluated on a newly introduced benchmark, VisDoTQA. The method achieves an 11.2% improvement on ChartQA, outperforms GPT-4o on ChartQAPro, and yields a 33.2% gain on VisDoTQA, while also significantly enhancing zero-shot performance across multiple open-domain VQA tasks.

Technology Category

Application Category

📝 Abstract
Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.
Problem

Research questions and friction points this paper is trying to address.

visual reasoning
perceptual grounding
chart understanding
vision-language models
visual primitives
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual grounding
Decomposition-of-Thought
graphical perception
chart reasoning
vision-language models
🔎 Similar Papers
No similar papers found.
E
Eunsoo Lee
Department of Computer Science and Artificial Intelligence, Dongguk University
J
Jeongwoo Lee
Department of Electronics and Electrical Engineering, Dongguk University
Minki Hong
Minki Hong
Korea Advanced Institute of Science and Technology
Human-Computer InteractionComputational Interaction
J
Jangho Choi
Department of Computer Science and Artificial Intelligence, Dongguk University
Jihie Kim
Jihie Kim
Dongguk University
Artificial IntelligenceComputer EducationHuman Computer InteractionNLP