Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Current visual foundation reasoning models lack a comprehensive benchmark evaluating fine-grained object perception, traceable evidence support (e.g., bounding boxes), and higher-order spatial/interactive reasoning. To address this gap, we propose TreeBench—the first traceability-aware visual reasoning benchmark—comprising 405 challenging visual question-answer pairs, sampled from SA-1B and annotated by eight expert large multimodal models (LMMs). We further introduce TreeVGR, a novel training paradigm leveraging reinforcement learning to jointly supervise object localization and reasoning-path generation, built upon Qwen2.5-VL-7B. Experiments reveal that state-of-the-art models achieve sub-60% accuracy on TreeBench (e.g., OpenAI-o3: 54.87%), while TreeVGR yields substantial gains: +13.4 on TreeBench, +16.8 on V*-Bench, and +12.6 on MME-RealWorld. This work establishes, for the first time, the critical role of traceability in advancing robust visual reasoning.

Technology Category

Application Category

📝 Abstract

Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.

Problem

Research questions and friction points this paper is trying to address.

Lack of benchmark for holistic visual reasoning evaluation

Need for traceable evidence in visual grounded reasoning

Challenges in complex scene perception and reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diagnostic benchmark for visual reasoning evaluation

Reinforcement learning for joint localization and reasoning

Traceable evidence via bounding box evaluation

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling