🤖 AI Summary
Existing data quality (DQ) frameworks are designed for static datasets and fail to accommodate the inherent dynamism and multi-stage pipeline of retrieval-augmented generation (RAG) systems. Method: Drawing on semi-structured interviews with 16 IT service practitioners and qualitative content analysis, we identify— for the first time—15 stage-specific DQ dimensions spanning all four RAG phases: query, retrieval, re-ranking, and generation, revealing how DQ issues evolve across the pipeline. Contribution/Results: We propose a “front-end-first, stage-aware” dynamic DQ management framework that delineates the applicability boundaries of conventional DQ approaches and establishes phased governance pathways. This framework provides an actionable foundation for DQ assessment and improvement in enterprise-scale RAG-based knowledge applications.
📝 Abstract
Organizations increasingly adopt Retrieval-Augmented Generation (RAG) to enhance Large Language Models with enterprise-specific knowledge. However, current data quality (DQ) frameworks have been primarily developed for static datasets, and only inadequately address the dynamic, multi-stage nature of RAG systems. This study aims to develop DQ dimensions for this new type of AI-based systems. We conduct 16 semi-structured interviews with practitioners of leading IT service companies. Through a qualitative content analysis, we inductively derive 15 distinct DQ dimensions across the four processing stages of RAG systems: data extraction, data transformation, prompt & search, and generation. Our findings reveal that (1) new dimensions have to be added to traditional DQ frameworks to also cover RAG contexts; (2) these new dimensions are concentrated in early RAG steps, suggesting the need for front-loaded quality management strategies, and (3) DQ issues transform and propagate through the RAG pipeline, necessitating a dynamic, step-aware approach to quality management.