LONG²RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall

📅 2024-10-30
🏛️ Conference on Empirical Methods in Natural Language Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RAG evaluation benchmarks lack systematic assessment of joint long-document retrieval and long-text generation capabilities. Method: We introduce Long²RAG, the first benchmark tailored for long-context RAG, comprising 280 cross-domain questions and a document corpus with an average length of 2,444 words. We propose Key Point Recall (KPR), a fine-grained metric grounded in human-annotated semantic key points, evaluating depth of information utilization rather than superficial lexical matching. Our framework supports multi-document long-context settings, structured question answering, and automated KPR evaluation, and is compatible with mainstream RAG systems and both open- and closed-source LLMs. Results: Experiments reveal that state-of-the-art RAG systems achieve less than 42% key-information recall on long documents. KPR correlates strongly with human judgments (ρ = 0.89), substantially outperforming conventional metrics such as BLEU and ROUGE.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented generation (RAG) is a promising approach to address the limitations of fixed knowledge in large language models (LLMs). However, current benchmarks for evaluating RAG systems suffer from two key deficiencies: (1) they fail to adequately measure LLMs' capability in handling long-context retrieval due to a lack of datasets that reflect the characteristics of retrieved documents, and (2) they lack a comprehensive evaluation method for assessing LLMs' ability to generate long-form responses that effectively exploits retrieved information. To address these shortcomings, we introduce the Long$^2$RAG benchmark and the Key Point Recall (KPR) metric. Long$^2$RAG comprises 280 questions spanning 10 domains and across 8 question categories, each associated with 5 retrieved documents with an average length of 2,444 words. KPR evaluates the extent to which LLMs incorporate key points extracted from the retrieved documents into their generated responses, providing a more nuanced assessment of their ability to exploit retrieved information.
Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation
Long Document Retrieval
Evaluation Metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long$^2$RAG Dataset
Key Point Recall (KPR) Scoring
Retrieval-Augmented Generation (RAG) Evaluation
🔎 Similar Papers
No similar papers found.