THELMA: Task Based Holistic Evaluation of Large Language Model Applications-RAG Question Answering

πŸ“… 2025-05-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Addressing the challenge of end-to-end evaluation for RAG-based question-answering systems in the absence of labeled data or reference answers, this paper introduces THELMAβ€”the first task-driven, reference-free, and multidimensionally coupled full-stack evaluation framework. THELMA defines six fine-grained, computable metrics spanning retrieval, generation, and response quality (e.g., semantic consistency, factual accuracy, completeness), integrating LLM-based self-assessment with structured reasoning analysis to uncover intrinsic metric interdependencies and precisely localize system weaknesses. Evaluated across multiple RAG benchmarks, THELMA achieves strong agreement with human judgments (average Spearman ρ > 0.89), significantly improving fault localization accuracy and iteration efficiency. It establishes a reliable, automated evaluation paradigm for continuous monitoring and optimization of production RAG systems.

Technology Category

Application Category

πŸ“ Abstract
We propose THELMA (Task Based Holistic Evaluation of Large Language Model Applications), a reference free framework for RAG (Retrieval Augmented generation) based question answering (QA) applications. THELMA consist of six interdependent metrics specifically designed for holistic, fine grained evaluation of RAG QA applications. THELMA framework helps developers and application owners evaluate, monitor and improve end to end RAG QA pipelines without requiring labelled sources or reference responses.We also present our findings on the interplay of the proposed THELMA metrics, which can be interpreted to identify the specific RAG component needing improvement in QA applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluating RAG QA apps without reference responses
Developing holistic metrics for RAG QA assessment
Identifying RAG components needing improvement in QA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reference-free framework for RAG QA evaluation
Six metrics for holistic RAG assessment
Identifies RAG components needing improvement
πŸ”Ž Similar Papers
No similar papers found.
Udita Patel
Udita Patel
Amazon.com
NLP
R
Rutu Mulkar
Amazon.com Services Inc.
Jay Roberts
Jay Roberts
Protopia AI
Deep LearningMathematics
C
Cibi Chakravarthy Senthilkumar
Amazon.com Services Inc.
S
Sujay Gandhi
Amazon.com Services Inc.
X
Xiaofei Zheng
Amazon.com Services Inc.
N
Naumaan Nayyar
Amazon.com Services Inc.
R
Rafael Castrillo
Amazon.com Services Inc.