🤖 AI Summary
Evaluating Retrieval-Augmented Generation (RAG) systems in domain-specific scenarios remains challenging due to high annotation costs, lack of fine-grained evaluation metrics, and difficulty in constructing representative test data. Method: This paper proposes a schema-driven framework for automated evaluation dataset construction. It introduces a schema-based pipeline to generate high-quality document–question–answer–reference quadruples and pioneers three fine-grained metrics targeting factual accuracy: Completeness, Hallucination, and Irrelevance. Evaluation combines LLM-based automatic scoring with human calibration to ensure high inter-annotator agreement (Cohen’s κ > 0.85). Contribution/Results: Experiments demonstrate that the generated samples significantly outperform zero-shot and one-shot baselines across clarity, safety, consistency, and semantic richness. The framework establishes a new paradigm for scenario-aware RAG evaluation, enabling scalable, reliable, and interpretable assessment.
📝 Abstract
Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance to evaluate LLM generated responses rigorously. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications. The code and dataset are released at https://github.com/OpenBMB/RAGEval.