kRAIG: A Natural Language-Driven Agent for Automated DataOps Pipeline Generation

๐Ÿ“… 2026-03-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Automated construction of executable ELT pipelines faces significant challenges, including ambiguous user intent, unreliable tool generation, and a lack of execution guarantees in outputs. This work proposes kRAIG, a natural languageโ€“driven AI agent that leverages a novel ReQuesAct interaction framework to explicitly clarify user intent. By integrating retrieval-augmented component composition with a multi-stage LLM-based verification mechanism, kRAIG automatically translates high-level instructions into production-grade Kubeflow pipelines. Experimental results demonstrate that kRAIG achieves a threefold improvement in data extraction and loading success rates and a 25% increase in transformation accuracy compared to existing approaches, substantially enhancing the reliability, executability, and practicality of automated data engineering pipelines.

Technology Category

Application Category

๐Ÿ“ Abstract
Modern machine learning systems rely on complex data engineering workflows to extract, transform, and load (ELT) data into production pipelines. However, constructing these pipelines remains time-consuming and requires substantial expertise in data infrastructure and orchestration frameworks. Recent advances in large language model (LLM) agents offer a potential path toward automating these workflows, but existing approaches struggle with under-specified user intent, unreliable tool generation, and limited guarantees of executable outputs. We introduce kRAIG, an AI agent that translates natural language specifications into production-ready Kubeflow Pipelines (KFP). To resolve ambiguity in user intent, we propose ReQuesAct (Reason, Question, Act), an interaction framework that explicitly clarifies intent prior to pipeline synthesis. The system orchestrates end-to-end data movement from diverse sources and generates task-specific transformation components through a retrieval-augmented tool synthesis process. To ensure data quality and safety, kRAIG incorporates LLM-based validation stages that verify pipeline integrity prior to execution. Our framework achieves a 3x improvement in extraction and loading success and a 25 percent increase in transformation accuracy compared to state-of-the-art agentic baselines. These improvements demonstrate that structured agent workflows with explicit intent clarification and validation significantly enhance the reliability and executability of automated data engineering pipelines.
Problem

Research questions and friction points this paper is trying to address.

DataOps
pipeline generation
natural language
LLM agents
executable workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

ReQuesAct
retrieval-augmented tool synthesis
LLM-based validation
natural language to pipeline
automated DataOps