Quality Assurance for LLM-RAG Systems: Empirical Insights from Tourism Application Testing

📅 2025-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of evaluating LLM-RAG system quality in real-world tourism recommendation scenarios. Methodologically, it introduces the first RAG quality assurance framework tailored for production-grade tourism systems, featuring a 17-item evaluation metric suite spanning syntactic, semantic, and behavioral dimensions. Leveraging an LLM-judged multi-dimensional assessment paradigm, the study conducts systematic empirical evaluations across three large language models under varied RAG architectures and hyperparameters (e.g., temperature, top-p). Key findings reveal a nonlinear impact of temperature and top-p on response quality; notably, newer model versions significantly increase response length and structural complexity but yield only marginal gains in semantic fidelity. The work delivers a reproducible, empirically grounded methodology and practical guidelines for quality testing of operational LLM-RAG systems.

Technology Category

Application Category

📝 Abstract
This paper presents a comprehensive framework for testing and evaluating quality characteristics of Large Language Model (LLM) systems enhanced with Retrieval-Augmented Generation (RAG) in tourism applications. Through systematic empirical evaluation of three different LLM variants across multiple parameter configurations, we demonstrate the effectiveness of our testing methodology in assessing both functional correctness and extra-functional properties. Our framework implements 17 distinct metrics that encompass syntactic analysis, semantic evaluation, and behavioral evaluation through LLM judges. The study reveals significant information about how different architectural choices and parameter configurations affect system performance, particularly highlighting the impact of temperature and top-p parameters on response quality. The tests were carried out on a tourism recommendation system for the V""armland region, utilizing standard and RAG-enhanced configurations. The results indicate that the newer LLM versions show modest improvements in performance metrics, though the differences are more pronounced in response length and complexity rather than in semantic quality. The research contributes practical insights for implementing robust testing practices in LLM-RAG systems, providing valuable guidance to organizations deploying these architectures in production environments.
Problem

Research questions and friction points this paper is trying to address.

Testing quality characteristics of LLM-RAG systems
Assessing functional correctness and extra-functional properties
Impact of architectural choices on system performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-RAG system testing framework
17 metrics for evaluation
Tourism application empirical study
🔎 Similar Papers
Bestoun S. Ahmed
Bestoun S. Ahmed
Professor in Computer Science, Karlstad University
Software TestingSoftware EngineeringSE4AIMLOps
L
Ludwig Otto Baader
dept. Mathematics, Informatics and Statistics, Ludwig Maximilian University Munich, Munich, Germany
Firas Bayram
Firas Bayram
Research Assistant, University of Luxembourg
Concept DriftMachine LearningSports Analytics
S
Siri Jagstedt
CTF, Service Research Center, Karlstad University, Karlstad, Sweden
P
Peter Magnusson
CTF, Service Research Center, Karlstad University, Karlstad, Sweden