🤖 AI Summary
Existing RAG systems lack comprehensive evaluation of multi-source factual integration, particularly in end-to-end settings requiring cross-document retrieval and synthesis. Method: We introduce FRAMES, the first holistic benchmark for end-to-end RAG that jointly evaluates factual consistency, retrieval quality, and reasoning depth. It is built upon a manually curated, challenging multi-hop question-answering dataset requiring cross-document evidence integration, coupled with a multi-step retrieval pipeline and fine-grained LLM response analysis. Contribution/Results: Experiments reveal critical limitations: state-of-the-art models achieve only 0.40 accuracy under zero-retrieval conditions; integrating multi-step retrieval boosts performance to 0.66 (+65%), exposing substantial bottlenecks in multi-hop reasoning and cross-source factual aggregation. FRAMES enables capability-decoupled assessment and iterative optimization of RAG systems, establishing a new standard for rigorous, multifaceted evaluation.
📝 Abstract
Large Language Models (LLMs) have demonstrated significant performance improvements across various cognitive tasks. An emerging application is using LLMs to enhance retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand user queries, retrieve relevant information, and synthesize coherent and accurate responses. Given the increasing real-world deployment of such systems, comprehensive evaluation becomes crucial. To this end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set), a high-quality evaluation dataset designed to test LLMs' ability to provide factual responses, assess retrieval capabilities, and evaluate the reasoning required to generate final answers. While previous work has provided datasets and benchmarks to evaluate these abilities in isolation, FRAMES offers a unified framework that provides a clearer picture of LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions that require the integration of information from multiple sources. We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval. The accuracy is significantly improved with our proposed multi-step retrieval pipeline, achieving an accuracy of 0.66 (>50% improvement). We hope our work will help bridge evaluation gaps and assist in developing more robust and capable RAG systems.