π€ AI Summary
Addressing the challenge of end-to-end evaluation for RAG-based question-answering systems in the absence of labeled data or reference answers, this paper introduces THELMAβthe first task-driven, reference-free, and multidimensionally coupled full-stack evaluation framework. THELMA defines six fine-grained, computable metrics spanning retrieval, generation, and response quality (e.g., semantic consistency, factual accuracy, completeness), integrating LLM-based self-assessment with structured reasoning analysis to uncover intrinsic metric interdependencies and precisely localize system weaknesses. Evaluated across multiple RAG benchmarks, THELMA achieves strong agreement with human judgments (average Spearman Ο > 0.89), significantly improving fault localization accuracy and iteration efficiency. It establishes a reliable, automated evaluation paradigm for continuous monitoring and optimization of production RAG systems.
π Abstract
We propose THELMA (Task Based Holistic Evaluation of Large Language Model Applications), a reference free framework for RAG (Retrieval Augmented generation) based question answering (QA) applications. THELMA consist of six interdependent metrics specifically designed for holistic, fine grained evaluation of RAG QA applications. THELMA framework helps developers and application owners evaluate, monitor and improve end to end RAG QA pipelines without requiring labelled sources or reference responses.We also present our findings on the interplay of the proposed THELMA metrics, which can be interpreted to identify the specific RAG component needing improvement in QA applications.