🤖 AI Summary
This work addresses the limitation of existing distributed data pipeline systems, which require users to explicitly define complete workflow graphs, by proposing a unified planning and scheduling framework that automatically constructs end-to-end persistent pipelines from implicit goal declarations alone. The approach introduces, for the first time, a numeric-domain-independent planner into the context of persistent scheduling, integrating workflow and resource graph modeling, numeric planning, and network interface scheduling to achieve full automation. Experimental results demonstrate the feasibility and scalability of the method: under a single-machine constraint of one hour of CPU time and 30 GB of memory, the system successfully scheduled a linear pipeline spanning eight sites and comprising fourteen components.
📝 Abstract
This work pursues automated planning and scheduling of distributed data pipelines, or workflows. We develop a general workflow and resource graph representation that includes both data processing and sharing components with corresponding network interfaces for scheduling. Leveraging these graphs, we introduce WORKSWORLD, a new domain for numeric domain-independent planners designed for permanently scheduled workflows, like ingest pipelines. Our framework permits users to define data sources, available workflow components, and desired data destinations and formats without explicitly declaring the entire workflow graph as a goal. The planner solves a joint planning and scheduling problem, producing a plan that both builds the workflow graph and schedules its components on the resource graph. We empirically show that a state-of-the-art numeric planner running on commodity hardware with one hour of CPU time and 30GB of memory can solve linear-chain workflows of up to 14 components across eight sites.