THELMA: Task Based Holistic Evaluation of Large Language Model Applications-RAG Question Answering

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Addressing the challenge of end-to-end evaluation for RAG-based question-answering systems in the absence of labeled data or reference answers, this paper introduces THELMA—the first task-driven, reference-free, and multidimensionally coupled full-stack evaluation framework. THELMA defines six fine-grained, computable metrics spanning retrieval, generation, and response quality (e.g., semantic consistency, factual accuracy, completeness), integrating LLM-based self-assessment with structured reasoning analysis to uncover intrinsic metric interdependencies and precisely localize system weaknesses. Evaluated across multiple RAG benchmarks, THELMA achieves strong agreement with human judgments (average Spearman ρ > 0.89), significantly improving fault localization accuracy and iteration efficiency. It establishes a reliable, automated evaluation paradigm for continuous monitoring and optimization of production RAG systems.

Technology Category

Application Category

📝 Abstract

We propose THELMA (Task Based Holistic Evaluation of Large Language Model Applications), a reference free framework for RAG (Retrieval Augmented generation) based question answering (QA) applications. THELMA consist of six interdependent metrics specifically designed for holistic, fine grained evaluation of RAG QA applications. THELMA framework helps developers and application owners evaluate, monitor and improve end to end RAG QA pipelines without requiring labelled sources or reference responses.We also present our findings on the interplay of the proposed THELMA metrics, which can be interpreted to identify the specific RAG component needing improvement in QA applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluating RAG QA apps without reference responses

Developing holistic metrics for RAG QA assessment

Identifying RAG components needing improvement in QA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reference-free framework for RAG QA evaluation

Six metrics for holistic RAG assessment

Identifies RAG components needing improvement

🔎 Similar Papers

No similar papers found.

Qualcomm

$104,000.00 - $156,000.00

San Diego, California, United States of America

Research Scientist Intern, Multimodal AI (PhD)