RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

📅 2024-08-02
🏛️ arXiv.org
📈 Citations: 10
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating Retrieval-Augmented Generation (RAG) systems in domain-specific scenarios remains challenging due to high annotation costs, lack of fine-grained evaluation metrics, and difficulty in constructing representative test data. Method: This paper proposes a schema-driven framework for automated evaluation dataset construction. It introduces a schema-based pipeline to generate high-quality document–question–answer–reference quadruples and pioneers three fine-grained metrics targeting factual accuracy: Completeness, Hallucination, and Irrelevance. Evaluation combines LLM-based automatic scoring with human calibration to ensure high inter-annotator agreement (Cohen’s κ > 0.85). Contribution/Results: Experiments demonstrate that the generated samples significantly outperform zero-shot and one-shot baselines across clarity, safety, consistency, and semantic richness. The framework establishes a new paradigm for scenario-aware RAG evaluation, enabling scalable, reliable, and interpretable assessment.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance to evaluate LLM generated responses rigorously. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications. The code and dataset are released at https://github.com/OpenBMB/RAGEval.
Problem

Research questions and friction points this paper is trying to address.

Evaluating RAG systems in specialized scenarios is challenging.
RAGEval generates high-quality data for RAG system evaluation.
Proposes new metrics for factual accuracy in LLM responses.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Schema-based pipeline generates evaluation documents.
Novel metrics: Completeness, Hallucination, Irrelevance.
LLMs score metrics with high human consistency.
🔎 Similar Papers
No similar papers found.