LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address the lack of publicly available, reproducible evaluation benchmarks for Retrieval-Augmented Generation (RAG) question-answering systems, this paper introduces RAGBench—the first synthetic benchmark grounded in Item Response Theory (IRT). It comprises 895 QA pairs spanning a continuous spectrum of difficulty and discrimination levels. Methodologically, RAGBench integrates real response data from the SIGIR’2025 RAG Challenge and applies IRT to rigorously quantify item difficulty and discrimination; each question is accompanied by a canonical answer and supporting factual statements. Its key contribution lies in pioneering the adoption of psychometric modeling for RAG evaluation, thereby enhancing assessment granularity, fairness, and model discriminability. Empirical validation confirms that RAGBench exhibits high diversity and robustness, enabling precise performance diagnostics, attribution analysis, and iterative optimization of RAG systems.

Technology Category

Application Category

📝 Abstract

With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems. This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors' answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to competitors' responses. Our analysis highlights the benchmark's questions diversity, the wide range of their difficulty levels, and their usefulness in differentiating between system capabilities. The LiveRAG benchmark will hopefully help the community advance RAG research, conduct systematic evaluation, and develop more robust Q&A systems.

Problem

Research questions and friction points this paper is trying to address.

Systematically evaluating effectiveness of Retrieval Augmented Generation systems

Providing diverse questions with varying difficulty levels for RAG evaluation

Developing robust Q&A systems through comprehensive benchmark testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset for RAG evaluation

Questions with difficulty and discriminability scores

Supporting claims for answer validation

🔎 Similar Papers

Does RAG Introduce Unfairness in LLMs? Evaluating Fairness in Retrieval-Augmented Generation Systems