Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Evaluating retrieval-augmented generation (RAG) systems in the large language model (LLM) era faces unique challenges—including hybrid architectures, dynamic knowledge sources, and multidimensional trustworthiness requirements. Method: We bridge traditional and LLM-native evaluation paradigms for the first time, proposing a four-dimensional taxonomy covering performance, factual accuracy, safety, and computational efficiency. Our meta-analysis synthesizes insights from 120+ high-impact studies. Evaluation methodology integrates human assessment, automated metrics (e.g., ROUGE, BERTScore), fact-checking tools, retrieval-quality measures, and end-to-end interpretability analysis. We further construct a RAG-specific benchmark dataset and a framework classification taxonomy to expose current practice biases. Contribution/Results: This work delivers the first standardized, multidimensional, trust-oriented evaluation guideline for RAG systems—enabling rigorous, reproducible, and responsible development and deployment of trustworthy RAG applications.

Technology Category

Application Category

📝 Abstract

Recent advancements in Retrieval-Augmented Generation (RAG) have revolutionized natural language processing by integrating Large Language Models (LLMs) with external information retrieval, enabling accurate, up-to-date, and verifiable text generation across diverse applications. However, evaluating RAG systems presents unique challenges due to their hybrid architecture that combines retrieval and generation components, as well as their dependence on dynamic knowledge sources in the LLM era. In response, this paper provides a comprehensive survey of RAG evaluation methods and frameworks, systematically reviewing traditional and emerging evaluation approaches, for system performance, factual accuracy, safety, and computational efficiency in the LLM era. We also compile and categorize the RAG-specific datasets and evaluation frameworks, conducting a meta-analysis of evaluation practices in high-impact RAG research. To the best of our knowledge, this work represents the most comprehensive survey for RAG evaluation, bridging traditional and LLM-driven methods, and serves as a critical resource for advancing RAG development.

Problem

Research questions and friction points this paper is trying to address.

Evaluating hybrid RAG systems combining retrieval and generation components

Assessing RAG performance, accuracy, safety, and efficiency in LLM era

Surveying RAG evaluation methods, datasets, and frameworks comprehensively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLMs with external information retrieval

Surveys RAG evaluation methods comprehensively

Compiles RAG-specific datasets and frameworks

🔎 Similar Papers

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues