Look As You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing VD-RAG methods lack fine-grained visual evidence attribution and traceable reasoning processes, rendering predictions unverifiable. To address this, we propose the Chain-of-Evidence paradigm and the Look As You Think reinforcement learning framework, which jointly integrate chain-of-thought reasoning with visual evidence localization. At each reasoning step, our method dynamically binds textual reasoning units to image regions—represented by bounding boxes and page indices—enabling process-level self-verification. Experiments based on Qwen2.5-VL-7B-Instruct demonstrate significant improvements: +8.23% in soft exact match and +47.0% in IoU@0.5, alongside strong cross-domain generalization. Our core contribution is the first integration of fine-grained multimodal evidence attribution into progressive reasoning chains, realized via end-to-end reinforcement learning to achieve verifiable visual document question answering.

Technology Category

Application Category

📝 Abstract
Aiming to identify precise evidence sources from visual documents, visual evidence attribution for visual document retrieval-augmented generation (VD-RAG) ensures reliable and verifiable predictions from vision-language models (VLMs) in multimodal question answering. Most existing methods adopt end-to-end training to facilitate intuitive answer verification. However, they lack fine-grained supervision and progressive traceability throughout the reasoning process. In this paper, we introduce the Chain-of-Evidence (CoE) paradigm for VD-RAG. CoE unifies Chain-of-Thought (CoT) reasoning and visual evidence attribution by grounding reference elements in reasoning steps to specific regions with bounding boxes and page indexes. To enable VLMs to generate such evidence-grounded reasoning, we propose Look As You Think (LAT), a reinforcement learning framework that trains models to produce verifiable reasoning paths with consistent attribution. During training, LAT evaluates the attribution consistency of each evidence region and provides rewards only when the CoE trajectory yields correct answers, encouraging process-level self-verification. Experiments on vanilla Qwen2.5-VL-7B-Instruct with Paper- and Wiki-VISA benchmarks show that LAT consistently improves the vanilla model in both single- and multi-image settings, yielding average gains of 8.23% in soft exact match (EM) and 47.0% in IoU@0.5. Meanwhile, LAT not only outperforms the supervised fine-tuning baseline, which is trained to directly produce answers with attribution, but also exhibits stronger generalization across domains.
Problem

Research questions and friction points this paper is trying to address.

Visual document RAG lacks fine-grained supervision for evidence attribution
Existing methods lack progressive traceability throughout reasoning processes
Models need to ground reasoning steps to specific visual evidence regions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies reasoning with visual evidence attribution
Uses reinforcement learning for verifiable reasoning paths
Grounds reasoning steps to specific document regions
🔎 Similar Papers
No similar papers found.
Shuochen Liu
Shuochen Liu
University of Science and Technology of China
Large Language Model
P
Pengfei Luo
University of Science and Technology of China
C
Chao Zhang
University of Science and Technology of China
Y
Yuhao Chen
University of Science and Technology of China
H
Haotian Zhang
University of Science and Technology of China
Q
Qi Liu
University of Science and Technology of China
X
Xin Kou
University of Science and Technology of China
T
Tong Xu
University of Science and Technology of China
Enhong Chen
Enhong Chen
University of Science and Technology of China
data miningrecommender systemmachine learning