VGR: Visual Grounded Reasoning

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing multimodal chain-of-thought methods rely solely on linguistic-space reasoning, suffering from language bias and struggling with tasks requiring fine-grained visual understanding. To address this, we propose Visual-Anchored Reasoning (VAR), a novel two-stage “detect-and-replay” paradigm: first localizing salient image regions, then dynamically re-encoding and replaying corresponding visual snippets to drive language-based reasoning—enabling fine-grained coupling of visual perception and linguistic inference. We construct the large-scale VGR-SFT dataset based on LLaVA-NeXT-7B, incorporating learnable bounding-box selection and region-adaptive visual re-encoding. Evaluated on MMStar, AI2D, and ChartQA, VAR achieves absolute improvements of +4.1, +7.1, and +12.9 points, respectively, while using only 30% of the image tokens required by baselines—demonstrating significant mitigation of insufficient visual detail modeling.

Technology Category

Application Category

📝 Abstract

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.

Problem

Research questions and friction points this paper is trying to address.

Addresses language bias in multimodal reasoning by integrating visual grounding

Enhances fine-grained visual perception for complex image understanding tasks

Improves reasoning accuracy with selective region detection and replay

Innovation

Methods, ideas, or system contributions that make the work stand out.

Detects relevant image regions for reasoning

Integrates visual regions into reasoning process

Uses fine-grained visual perception capabilities

🔎 Similar Papers

No similar papers found.