Data Quality Challenges in Retrieval-Augmented Generation

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Existing data quality (DQ) frameworks are designed for static datasets and fail to accommodate the inherent dynamism and multi-stage pipeline of retrieval-augmented generation (RAG) systems. Method: Drawing on semi-structured interviews with 16 IT service practitioners and qualitative content analysis, we identify— for the first time—15 stage-specific DQ dimensions spanning all four RAG phases: query, retrieval, re-ranking, and generation, revealing how DQ issues evolve across the pipeline. Contribution/Results: We propose a “front-end-first, stage-aware” dynamic DQ management framework that delineates the applicability boundaries of conventional DQ approaches and establishes phased governance pathways. This framework provides an actionable foundation for DQ assessment and improvement in enterprise-scale RAG-based knowledge applications.

Technology Category

Application Category

📝 Abstract

Organizations increasingly adopt Retrieval-Augmented Generation (RAG) to enhance Large Language Models with enterprise-specific knowledge. However, current data quality (DQ) frameworks have been primarily developed for static datasets, and only inadequately address the dynamic, multi-stage nature of RAG systems. This study aims to develop DQ dimensions for this new type of AI-based systems. We conduct 16 semi-structured interviews with practitioners of leading IT service companies. Through a qualitative content analysis, we inductively derive 15 distinct DQ dimensions across the four processing stages of RAG systems: data extraction, data transformation, prompt & search, and generation. Our findings reveal that (1) new dimensions have to be added to traditional DQ frameworks to also cover RAG contexts; (2) these new dimensions are concentrated in early RAG steps, suggesting the need for front-loaded quality management strategies, and (3) DQ issues transform and propagate through the RAG pipeline, necessitating a dynamic, step-aware approach to quality management.

Problem

Research questions and friction points this paper is trying to address.

Developing data quality dimensions for dynamic RAG systems

Addressing inadequacies in traditional static data quality frameworks

Analyzing quality issue propagation across multi-stage AI pipelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops 15 data quality dimensions for RAG systems

Identifies quality dimensions across four RAG processing stages

Proposes dynamic quality management for RAG pipelines

🔎 Similar Papers

No similar papers found.

Qualcomm

$104,000.00 - $156,000.00

San Diego, California, United States of America

Authors to Follow