Prompt2DAG: A Modular Methodology for LLM-Based Data Enrichment Pipeline Generation

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

To address the high barrier to entry and heavy reliance on engineering expertise in data pipeline development, this paper proposes a hybrid generative approach that automatically compiles natural language specifications into reliable, executable Apache Airflow DAGs. The method synergistically integrates large language models (LLMs) for semantic understanding, structured template engines for deterministic syntactic constraints, and a multi-stage validation mechanism—balancing expressiveness with correctness guarantees. We introduce a novel three-dimensional evaluation framework—SAT (Semantic Accuracy), DST (Structural Integrity), and PCT (Programmatic Executability)—to systematically quantify generation quality. Experimental results demonstrate a 78.5% successful generation rate, substantially outperforming pure-LLM (66.2%) and end-to-end generative baselines (29.2%), while achieving over twofold improvement in cost efficiency per generated DAG. This work provides a practical, production-ready pathway toward democratizing low-code data pipeline development.

Technology Category

Application Category

📝 Abstract

Developing reliable data enrichment pipelines demands significant engineering expertise. We present Prompt2DAG, a methodology that transforms natural language descriptions into executable Apache Airflow DAGs. We evaluate four generation approaches -- Direct, LLM-only, Hybrid, and Template-based -- across 260 experiments using thirteen LLMs and five case studies to identify optimal strategies for production-grade automation. Performance is measured using a penalized scoring framework that combines reliability with code quality (SAT), structural integrity (DST), and executability (PCT). The Hybrid approach emerges as the optimal generative method, achieving a 78.5% success rate with robust quality scores (SAT: 6.79, DST: 7.67, PCT: 7.76). This significantly outperforms the LLM-only (66.2% success) and Direct (29.2% success) methods. Our findings show that reliability, not intrinsic code quality, is the primary differentiator. Cost-effectiveness analysis reveals the Hybrid method is over twice as efficient as Direct prompting per successful DAG. We conclude that a structured, hybrid approach is essential for balancing flexibility and reliability in automated workflow generation, offering a viable path to democratize data pipeline development.

Problem

Research questions and friction points this paper is trying to address.

Automating data enrichment pipeline generation from natural language

Evaluating optimal LLM strategies for reliable workflow automation

Balancing flexibility and reliability in automated DAG creation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid method for reliable DAG generation

Natural language to executable Airflow transformation

Penalized scoring framework evaluating pipeline quality

🔎 Similar Papers

On The Role of Prompt Construction In Enhancing Efficacy and Efficiency of LLM-Based Tabular Data Generation