InspectorRAGet: An Introspection Platform for RAG Evaluation

📅 2024-04-26

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 1

career value

161K/year

🤖 AI Summary

Existing RAG systems lack deep, end-to-end evaluation tools for quality diagnosis and root-cause attribution—particularly those reconciling automated metrics, human judgment, and annotation reliability. To address this gap, we introduce the first explainable, multi-dimensional, human-in-the-loop introspection platform specifically designed for RAG systems. Our framework integrates an automated evaluation pipeline, annotation quality modeling, interactive visual analytics, and an open API. It supports both aggregate statistics and instance-level fine-grained attribution across key dimensions—including retrieval relevance, generation faithfulness, and information completeness—while quantifying annotator reliability via rigorous inter-annotator agreement modeling. We validate the platform across multiple public RAG benchmarks under diverse scenarios. All code and the platform are fully open-sourced, facilitating standardized, reproducible RAG evaluation research.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLM) have become a popular approach for implementing Retrieval Augmented Generation (RAG) systems, and a significant amount of effort has been spent on building good models and metrics. In spite of increased recognition of the need for rigorous evaluation of RAG systems, few tools exist that go beyond the creation of model output and automatic calculation. We present InspectorRAGet, an introspection platform for performing a comprehensive analysis of the quality of RAG system output. InspectorRAGet allows the user to analyze aggregate and instance-level performance of RAG systems, using both human and algorithmic metrics as well as annotator quality. InspectorRAGet is suitable for multiple use cases and is available publicly to the community. A live instance of the platform is available at https://ibm.biz/InspectorRAGet.

Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive RAG evaluation tools beyond basic metrics

Need for analyzing RAG performance at aggregate and instance levels

Combining human and algorithmic metrics for RAG quality assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introspection platform for RAG evaluation

Combines human and algorithmic metrics

Analyzes aggregate and instance-level performance

🔎 Similar Papers

FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research