Controlled Retrieval-augmented Context Evaluation for Long-form RAG

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Existing RAG evaluation methods predominantly rely on relevance-ranking metrics, which fail to capture the actual impact of retrieval on long-text generation (e.g., report writing). Method: We propose CRUX, the first evaluation framework based on *controlled information scope*: it employs human-authored abstractions to precisely define the required knowledge boundary and construct gold-standard context; then applies question-driven, fine-grained assessment that jointly evaluates relevance matching and coverage—quantifying context quality independently of the generation process. Contribution/Results: Experiments demonstrate that CRUX significantly outperforms conventional metrics in diagnostic capability, accurately pinpointing retrieval-specific deficiencies in long-text generation scenarios. By providing interpretable, actionable insights, CRUX enables principled optimization of RAG retrieval components.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation (RAG) enhances large language models by incorporating context retrieved from external knowledge sources. While the effectiveness of the retrieval module is typically evaluated with relevance-based ranking metrics, such metrics may be insufficient to reflect the retrieval's impact on the final RAG result, especially in long-form generation scenarios. We argue that providing a comprehensive retrieval-augmented context is important for long-form RAG tasks like report generation and propose metrics for assessing the context independent of generation. We introduce CRUX, a extbf{C}ontrolled extbf{R}etrieval-a extbf{U}gmented conte extbf{X}t evaluation framework designed to directly assess retrieval-augmented contexts. This framework uses human-written summaries to control the information scope of knowledge, enabling us to measure how well the context covers information essential for long-form generation. CRUX uses question-based evaluation to assess RAG's retrieval in a fine-grained manner. Empirical results show that CRUX offers more reflective and diagnostic evaluation. Our findings also reveal substantial room for improvement in current retrieval methods, pointing to promising directions for advancing RAG's retrieval. Our data and code are publicly available to support and advance future research on retrieval.

Problem

Research questions and friction points this paper is trying to address.

Evaluating retrieval impact on long-form RAG results

Assessing context coverage for long-form generation tasks

Improving retrieval methods for comprehensive RAG performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

CRUX framework evaluates retrieval-augmented contexts directly

Uses human summaries to control information scope

Employs question-based fine-grained retrieval assessment

🔎 Similar Papers

No similar papers found.