Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) generate only final answers in visual reasoning tasks, lacking interpretable intermediate reasoning steps and fine-grained visual grounding (e.g., pixel- or coordinate-level evidence). Method: We introduce the Visual Reasoning Tracing (VRT) task, requiring models to explicitly predict an object-level reasoning trajectory from initial observation to target localization. To support this, we construct VRT-Bench—a human-annotated, object-grounded evaluation benchmark—and VRT-80k, a large-scale training dataset. We propose three novel evaluation dimensions: trajectory completeness, grounding fidelity, and logical consistency. Building upon MLLM architectures, we jointly model visual localization and stepwise reasoning via supervised learning on human-annotated reasoning chains. Results: Models trained on VRT-80k demonstrate significantly improved reasoning path tracing capability. Our study is the first to systematically expose a critical deficiency of existing MLLMs in intermediate, visually grounded reasoning—highlighting the necessity of explicit, object-level reasoning traces for robust visual understanding.

Technology Category

Application Category

📝 Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.

Problem

Research questions and friction points this paper is trying to address.

MLLMs lack transparent reasoning steps and fine-grained evidence

Need to localize target objects and predict intermediate reasoning paths

Existing models struggle to ground intermediate reasoning despite correct outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Requires models to localize target and intermediate objects

Introduces a benchmark, metric, and dataset for evaluation

Trains models on a large-scale dataset to improve reasoning traces

🔎 Similar Papers

No similar papers found.