Data Quality Challenges in Retrieval-Augmented Generation

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing data quality (DQ) frameworks are designed for static datasets and fail to accommodate the inherent dynamism and multi-stage pipeline of retrieval-augmented generation (RAG) systems. Method: Drawing on semi-structured interviews with 16 IT service practitioners and qualitative content analysis, we identify— for the first time—15 stage-specific DQ dimensions spanning all four RAG phases: query, retrieval, re-ranking, and generation, revealing how DQ issues evolve across the pipeline. Contribution/Results: We propose a “front-end-first, stage-aware” dynamic DQ management framework that delineates the applicability boundaries of conventional DQ approaches and establishes phased governance pathways. This framework provides an actionable foundation for DQ assessment and improvement in enterprise-scale RAG-based knowledge applications.

Technology Category

Application Category

📝 Abstract
Organizations increasingly adopt Retrieval-Augmented Generation (RAG) to enhance Large Language Models with enterprise-specific knowledge. However, current data quality (DQ) frameworks have been primarily developed for static datasets, and only inadequately address the dynamic, multi-stage nature of RAG systems. This study aims to develop DQ dimensions for this new type of AI-based systems. We conduct 16 semi-structured interviews with practitioners of leading IT service companies. Through a qualitative content analysis, we inductively derive 15 distinct DQ dimensions across the four processing stages of RAG systems: data extraction, data transformation, prompt & search, and generation. Our findings reveal that (1) new dimensions have to be added to traditional DQ frameworks to also cover RAG contexts; (2) these new dimensions are concentrated in early RAG steps, suggesting the need for front-loaded quality management strategies, and (3) DQ issues transform and propagate through the RAG pipeline, necessitating a dynamic, step-aware approach to quality management.
Problem

Research questions and friction points this paper is trying to address.

Developing data quality dimensions for dynamic RAG systems
Addressing inadequacies in traditional static data quality frameworks
Analyzing quality issue propagation across multi-stage AI pipelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops 15 data quality dimensions for RAG systems
Identifies quality dimensions across four RAG processing stages
Proposes dynamic quality management for RAG pipelines
🔎 Similar Papers
No similar papers found.
L
Leopold Müller
University of Bayreuth, Bayreuth, Germany
J
Joshua Holstein
Karlsruhe Institute of Technology, Karlsruhe, Germany
S
Sarah Bause
Karlsruhe Institute of Technology, Karlsruhe, Germany
Gerhard Satzger
Gerhard Satzger
Karlsruhe Service Research Institute / Institute of Information Systems and Marketing, KIT
Digital ServicesService AnalyticsService InnovationHuman-AI-Collaboration
Niklas Kühl
Niklas Kühl
University of Bayreuth
Artificial IntelligenceHuman-AI-TeamsFairness in Machine LearningAppropriate Reliance