RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

📅 2024-06-25

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing RAG systems lack unified, interpretable evaluation criteria and large-scale, human-annotated benchmarks. Method: We introduce RAGBench—the first industrial-grade RAG benchmark comprising 100K samples across five domains, constructed from real-world user manuals—and propose TRACe, an end-to-end interpretable evaluation framework enabling cross-domain, joint retrieval-generation assessment for the first time. TRACe integrates RoBERTa fine-tuning, LLM-based comparative evaluation, multi-dimensional human annotation, and domain-adaptive evaluation design. Contribution/Results: Experiments demonstrate that RoBERTa-based evaluation significantly outperforms LLM-based alternatives; TRACe effectively identifies critical performance bottlenecks in current RAG systems. Both the RAGBench dataset and the TRACe evaluation toolkit are publicly released, establishing a robust, reproducible evaluation infrastructure to advance RAG research and industrial deployment.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) has become a standard architectural pattern for incorporating domain-specific knowledge into user-facing chat applications powered by Large Language Models (LLMs). RAG systems are characterized by (1) a document retriever that queries a domain-specific corpus for context information relevant to an input query, and (2) an LLM that generates a response based on the provided query and context. However, comprehensive evaluation of RAG systems remains a challenge due to the lack of unified evaluation criteria and annotated datasets. In response, we introduce RAGBench: the first comprehensive, large-scale RAG benchmark dataset of 100k examples. It covers five unique industry-specific domains and various RAG task types. RAGBench examples are sourced from industry corpora such as user manuals, making it particularly relevant for industry applications. Further, we formalize the TRACe evaluation framework: a set of explainable and actionable RAG evaluation metrics applicable across all RAG domains. We release the labeled dataset at https://huggingface.co/datasets/rungalileo/ragbench. RAGBench explainable labels facilitate holistic evaluation of RAG systems, enabling actionable feedback for continuous improvement of production applications. Thorough extensive benchmarking, we find that LLM-based RAG evaluation methods struggle to compete with a finetuned RoBERTa model on the RAG evaluation task. We identify areas where existing approaches fall short and propose the adoption of RAGBench with TRACe towards advancing the state of RAG evaluation systems.

Problem

Research questions and friction points this paper is trying to address.

RAG Systems

Evaluation Criteria

Test Dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

RAGBench

TRACe

RAG Evaluation

🔎 Similar Papers

Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study