SemPipes -- Optimizable Semantic Data Operators for Tabular Machine Learning Pipelines

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a declarative programming model for tabular machine learning that addresses the complexity, expert dependency, and high engineering costs of traditional data preparation pipelines. By integrating large language model (LLM)-driven semantic data operators into the ML workflow, the system enables users to specify desired data transformations in natural language. It then automatically synthesizes and optimizes executable operators through LLM-based code generation guided by evolutionary search, constructing an end-to-end optimizable data processing pipeline. This approach uniquely combines natural language–specified semantic operators with evolution-guided code synthesis, significantly improving predictive performance across diverse tabular ML tasks while substantially reducing the complexity of both manual pipeline design and agent-generated solutions.

Technology Category

Application Category

📝 Abstract
Real-world machine learning on tabular data relies on complex data preparation pipelines for prediction, data integration, augmentation, and debugging. Designing these pipelines requires substantial domain expertise and engineering effort, motivating the question of how large language models (LLMs) can support tabular ML through code synthesis. We introduce SemPipes, a novel declarative programming model that integrates LLM-powered semantic data operators into tabular ML pipelines. Semantic operators specify data transformations in natural language while delegating execution to a runtime system. During training, SemPipes synthesizes custom operator implementations based on data characteristics, operator instructions, and pipeline context. This design enables the automatic optimization of data operations in a pipeline via LLM-based code synthesis guided by evolutionary search. We evaluate SemPipes across diverse tabular ML tasks and show that semantic operators substantially improve end-to-end predictive performance for both expert-designed and agent-generated pipelines, while reducing pipeline complexity. We implement SemPipes in Python and release it at https://github.com/deem-data/sempipes/tree/v1.
Problem

Research questions and friction points this paper is trying to address.

tabular machine learning
data preparation pipelines
large language models
semantic data operators
code synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic data operators
LLM-based code synthesis
tabular machine learning
declarative programming
evolutionary search