FlowETL: An Autonomous Example-Driven Pipeline for Data Engineering

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing ETL pipelines heavily rely on manual, context-sensitive design of transformation logic, resulting in poor generalizability and low reusability. To address this, we propose an example-driven autonomous ETL framework: given user-provided target data examples, it constructs a paired-sample-based planning engine that automatically infers and synthesizes high-fidelity, context-adapted data transformation programs. Integrated with modular ETL components and runtime monitoring, the framework enables end-to-end automation for multi-format, multi-structured, and multi-scale data processing. Experiments across 14 real-world, cross-domain datasets demonstrate that our approach substantially reduces human intervention while achieving high-precision transformations (average F1 score of 0.92), strong generalization across diverse schemas and formats, and practical engineering deployability.

Technology Category

Application Category

📝 Abstract
The Extract, Transform, Load (ETL) workflow is fundamental for populating and maintaining data warehouses and other data stores accessed by analysts for downstream tasks. A major shortcoming of modern ETL solutions is the extensive need for a human-in-the-loop, required to design and implement context-specific, and often non-generalisable transformations. While related work in the field of ETL automation shows promising progress, there is a lack of solutions capable of automatically designing and applying these transformations. We present FlowETL, a novel example-based autonomous ETL pipeline architecture designed to automatically standardise and prepare input datasets according to a concise, user-defined target dataset. FlowETL is an ecosystem of components which interact together to achieve the desired outcome. A Planning Engine uses a paired input-output datasets sample to construct a transformation plan, which is then applied by an ETL worker to the source dataset. Monitoring and logging provide observability throughout the entire pipeline. The results show promising generalisation capabilities across 14 datasets of various domains, file structures, and file sizes.
Problem

Research questions and friction points this paper is trying to address.

Automating ETL workflows to reduce human intervention
Designing context-specific transformations without manual input
Standardizing diverse datasets using example-driven approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Example-based autonomous ETL pipeline architecture
Planning Engine constructs transformation plan automatically
Monitoring and logging ensure pipeline observability
🔎 Similar Papers
No similar papers found.
M
Mattia Di Profio
University of Aberdeen, Department of Computing Science
Mingjun Zhong
Mingjun Zhong
Department of Computing Science, University of Aberdeen, UK
Applied StatisticsMachine Learning
Y
Yaji Sripada
University of Aberdeen, Department of Computing Science
M
Marcel Jaspars
University of Aberdeen, Department of Chemistry