Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing iterative retrieval-augmented generation systems rely on textual parsing, which fails to preserve the visual semantics of documents—such as layout and chart structure—resulting in coarse-grained evidence attribution. This work proposes the Chain of Evidence framework, which achieves, for the first time, retriever-agnostic, pixel-level visual attribution by leveraging a fine-tuned Qwen3-VL-8B-Instruct model to perform end-to-end reasoning directly on document screenshots. The model outputs pixel-level evidence localization, thereby constructing interpretable multi-hop reasoning chains. By operating directly on visual inputs, the approach fully retains the original document’s visual structure and avoids losses inherent in format parsing. It significantly outperforms text-based baselines on the Wiki-CoE and SlideVQA benchmarks, demonstrating particularly strong performance on complex question-answering tasks that depend on visual layout.

📝 Abstract

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textit{Coarse-grained attribution}, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textit{Visual semantic loss}, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present \textbf{Chain of Evidence (CoE)}, a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: \textbf{Wiki-CoE}, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and \textbf{SlideVQA}, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.

Problem

Research questions and friction points this paper is trying to address.

Iterative Retrieval-Augmented Generation

Visual Attribution

Pixel-Level Evidence

Visual Semantic Loss

Coarse-grained Attribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual attribution

retrieval-augmented generation

vision-language models