🤖 AI Summary
Existing RAG systems lack deep, end-to-end evaluation tools for quality diagnosis and root-cause attribution—particularly those reconciling automated metrics, human judgment, and annotation reliability. To address this gap, we introduce the first explainable, multi-dimensional, human-in-the-loop introspection platform specifically designed for RAG systems. Our framework integrates an automated evaluation pipeline, annotation quality modeling, interactive visual analytics, and an open API. It supports both aggregate statistics and instance-level fine-grained attribution across key dimensions—including retrieval relevance, generation faithfulness, and information completeness—while quantifying annotator reliability via rigorous inter-annotator agreement modeling. We validate the platform across multiple public RAG benchmarks under diverse scenarios. All code and the platform are fully open-sourced, facilitating standardized, reproducible RAG evaluation research.
📝 Abstract
Large Language Models (LLM) have become a popular approach for implementing Retrieval Augmented Generation (RAG) systems, and a significant amount of effort has been spent on building good models and metrics. In spite of increased recognition of the need for rigorous evaluation of RAG systems, few tools exist that go beyond the creation of model output and automatic calculation. We present InspectorRAGet, an introspection platform for performing a comprehensive analysis of the quality of RAG system output. InspectorRAGet allows the user to analyze aggregate and instance-level performance of RAG systems, using both human and algorithmic metrics as well as annotator quality. InspectorRAGet is suitable for multiple use cases and is available publicly to the community. A live instance of the platform is available at https://ibm.biz/InspectorRAGet.