🤖 AI Summary
Evaluating retrieval-augmented generation (RAG) systems in the large language model (LLM) era faces unique challenges—including hybrid architectures, dynamic knowledge sources, and multidimensional trustworthiness requirements.
Method: We bridge traditional and LLM-native evaluation paradigms for the first time, proposing a four-dimensional taxonomy covering performance, factual accuracy, safety, and computational efficiency. Our meta-analysis synthesizes insights from 120+ high-impact studies. Evaluation methodology integrates human assessment, automated metrics (e.g., ROUGE, BERTScore), fact-checking tools, retrieval-quality measures, and end-to-end interpretability analysis. We further construct a RAG-specific benchmark dataset and a framework classification taxonomy to expose current practice biases.
Contribution/Results: This work delivers the first standardized, multidimensional, trust-oriented evaluation guideline for RAG systems—enabling rigorous, reproducible, and responsible development and deployment of trustworthy RAG applications.
📝 Abstract
Recent advancements in Retrieval-Augmented Generation (RAG) have revolutionized natural language processing by integrating Large Language Models (LLMs) with external information retrieval, enabling accurate, up-to-date, and verifiable text generation across diverse applications. However, evaluating RAG systems presents unique challenges due to their hybrid architecture that combines retrieval and generation components, as well as their dependence on dynamic knowledge sources in the LLM era. In response, this paper provides a comprehensive survey of RAG evaluation methods and frameworks, systematically reviewing traditional and emerging evaluation approaches, for system performance, factual accuracy, safety, and computational efficiency in the LLM era. We also compile and categorize the RAG-specific datasets and evaluation frameworks, conducting a meta-analysis of evaluation practices in high-impact RAG research. To the best of our knowledge, this work represents the most comprehensive survey for RAG evaluation, bridging traditional and LLM-driven methods, and serves as a critical resource for advancing RAG development.