VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the scarcity of scalable, region-aligned multi-step reasoning data for vision-language models, which hinders the verification of whether their reasoning is genuinely grounded in visual evidence. To bridge this gap, the authors propose a fully automated three-stage pipeline that leverages state-of-the-art object detection and OCR models to extract visual evidence and employs GPT-4o to generate chain-of-thought rationales explicitly linked to image regions, thereby constructing the first large-scale, region-aligned multi-step visual reasoning dataset. Additionally, they introduce a multidimensional evaluation benchmark assessing reasoning quality, answer accuracy, and vision-language alignment. Experiments demonstrate that this approach significantly enhances both the faithfulness and accuracy of reasoning in prominent models such as LLaVA-1.5 and Qwen2-VL, confirming its effectiveness and scalability.

Technology Category

Application Category

📝 Abstract

The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model's logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning-Answer Alignment. Experiments with representative LVLMs, including LLaVA-1.5 and Qwen2-VL, demonstrate consistent improvements on most evaluation metrics, confirming that VG-CoT effectively enhances trustworthy, evidence-based reasoning while maintaining scalable and cost-efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.

Problem

Research questions and friction points this paper is trying to address.

visual reasoning

trustworthiness

grounding

chain-of-thought

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Grounding

Chain-of-Thought Reasoning

Large Vision-Language Models