Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?

πŸ“… 2025-08-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing document-oriented RAG evaluation benchmarks heavily rely on synthetic data and suffer from narrow coverage, failing to reflect real-world bottlenecks. To address this, we introduce Double-Benchβ€”the first fully open-source, dynamically updatable, large-scale, multilingual, and multimodal evaluation benchmark for document RAG. It spans six languages, four categories of authentic documents (e.g., financial reports, legal contracts), over 70,000 manually annotated pages, and multi-hop queries, enabling fine-grained assessment across retrieval, evidence localization, and generation stages. Extensive experiments across nine embedding models, four multimodal large language models (MLLMs), and four RAG frameworks uncover two critical failure modes: evidence-agnostic answering and model overconfidence. Notably, we observe a significant narrowing of performance gaps between visual and textual embeddings. Double-Bench establishes a reproducible, scalable, and comprehensive evaluation infrastructure to advance robust, real-world document RAG systems.

Technology Category

Application Category

πŸ“ Abstract
Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language Models (MLLMs) show great promise for complex document understanding, yet their development is critically hampered by inadequate evaluation. Current benchmarks often focus on specific part of document RAG system and use synthetic data with incomplete ground truth and evidence labels, therefore failing to reflect real-world bottlenecks and challenges. To overcome these limitations, we introduce Double-Bench: a new large-scale, multilingual, and multimodal evaluation system that is able to produce fine-grained assessment to each component within document RAG systems. It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages and 4 document types with streamlined dynamic update support for potential data contamination issues. Queries are grounded in exhaustively scanned evidence pages and verified by human experts to ensure maximum quality and completeness. Our comprehensive experiments across 9 state-of-the-art embedding models, 4 MLLMs and 4 end-to-end document RAG frameworks demonstrate the gap between text and visual embedding models is narrowing, highlighting the need in building stronger document retrieval models. Our findings also reveal the over-confidence dilemma within current document RAG frameworks that tend to provide answer even without evidence support. We hope our fully open-source Double-Bench provide a rigorous foundation for future research in advanced document RAG systems. We plan to retrieve timely corpus and release new benchmarks on an annual basis.
Problem

Research questions and friction points this paper is trying to address.

Inadequate evaluation hampers Retrieval-Augmented Generation (RAG) system development.
Current benchmarks fail to reflect real-world document RAG challenges.
Need for fine-grained, multilingual, multimodal assessment in document RAG systems.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Double-Bench for fine-grained RAG evaluation
Uses multilingual, multimodal, large-scale document dataset
Assesses embedding models and RAG frameworks comprehensively
πŸ”Ž Similar Papers
No similar papers found.
W
Wenxuan Shen
South China University of Technology
M
Mingjia Wang
Huazhong University of Science and Technology
Y
Yaochen Wang
Huazhong University of Science and Technology
D
Dongping Chen
University of Maryland
J
Junjie Yang
South China University of Technology
Yao Wan
Yao Wan
Huazhong University of Science and Technology
NLPProgramming LanguagesSoftware EngineeringLarge Language Models
Weiwei Lin
Weiwei Lin
School of Physics, Southeast University
Condensed matter physicsmaterial sciencenanotechnologymagnetismspintronics