RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

📅 2024-08-02

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 0

career value

148K/year

🤖 AI Summary

Evaluating Retrieval-Augmented Generation (RAG) systems in domain-specific scenarios remains challenging due to high annotation costs, lack of fine-grained evaluation metrics, and difficulty in constructing representative test data. Method: This paper proposes a schema-driven framework for automated evaluation dataset construction. It introduces a schema-based pipeline to generate high-quality document–question–answer–reference quadruples and pioneers three fine-grained metrics targeting factual accuracy: Completeness, Hallucination, and Irrelevance. Evaluation combines LLM-based automatic scoring with human calibration to ensure high inter-annotator agreement (Cohen’s κ > 0.85). Contribution/Results: Experiments demonstrate that the generated samples significantly outperform zero-shot and one-shot baselines across clarity, safety, consistency, and semantic richness. The framework establishes a new paradigm for scenario-aware RAG evaluation, enabling scalable, reliable, and interpretable assessment.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance to evaluate LLM generated responses rigorously. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications. The code and dataset are released at https://github.com/OpenBMB/RAGEval.

Problem

Research questions and friction points this paper is trying to address.

Evaluating RAG systems in specialized scenarios is challenging.

RAGEval generates high-quality data for RAG system evaluation.

Proposes new metrics for factual accuracy in LLM responses.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Schema-based pipeline generates evaluation documents.

Novel metrics: Completeness, Hallucination, Irrelevance.

LLMs score metrics with high human consistency.

🔎 Similar Papers

FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research