MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation

📅 2025-04-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current RAG evaluation lacks fine-grained, component-decoupled benchmarks, hindering precise characterization of retrieval-generation interplay. To address this, we introduce MIRAGE—the first dedicated benchmark for adaptive RAG evaluation—comprising 7,560 QA pairs and 37,800 retrieved documents, enabling joint analysis of retrievers and LLM-based generators. We formally define and quantify four adaptive dimensions: noise vulnerability, context acceptability, context insensitivity, and context misinterpretation, and provide an integrated evaluation framework. Extensive experiments uncover systematic matching patterns between retrievers and LLMs, and demonstrate that our metrics are sensitive to real-world system deficiencies. The dataset, code, and evaluation toolkit are publicly released to advance standardized, reproducible RAG assessment.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) has gained prominence as an effective method for enhancing the generative capabilities of Large Language Models (LLMs) through the incorporation of external knowledge. However, the evaluation of RAG systems remains a challenge, due to the intricate interplay between retrieval and generation components. This limitation has resulted in a scarcity of benchmarks that facilitate a detailed, component-specific assessment. In this work, we present MIRAGE, a Question Answering dataset specifically designed for RAG evaluation. MIRAGE consists of 7,560 curated instances mapped to a retrieval pool of 37,800 entries, enabling an efficient and precise evaluation of both retrieval and generation tasks. We also introduce novel evaluation metrics aimed at measuring RAG adaptability, encompassing dimensions such as noise vulnerability, context acceptability, context insensitivity, and context misinterpretation. Through comprehensive experiments across various retriever-LLM configurations, we provide new insights into the optimal alignment of model pairs and the nuanced dynamics within RAG systems. The dataset and evaluation code are publicly available, allowing for seamless integration and customization in diverse research settingsfootnote{The MIRAGE code and data are available at https://github.com/nlpai-lab/MIRAGE.
Problem

Research questions and friction points this paper is trying to address.

Evaluating RAG systems lacks comprehensive benchmarks
Assessing retrieval-generation interplay requires specialized metrics
Current RAG evaluation lacks component-specific analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MIRAGE dataset for RAG evaluation
Develops novel metrics for RAG adaptability
Enables precise evaluation of retrieval and generation
🔎 Similar Papers
No similar papers found.