SemPipes -- Optimizable Semantic Data Operators for Tabular Machine Learning Pipelines

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work proposes a declarative programming model for tabular machine learning that addresses the complexity, expert dependency, and high engineering costs of traditional data preparation pipelines. By integrating large language model (LLM)-driven semantic data operators into the ML workflow, the system enables users to specify desired data transformations in natural language. It then automatically synthesizes and optimizes executable operators through LLM-based code generation guided by evolutionary search, constructing an end-to-end optimizable data processing pipeline. This approach uniquely combines natural language–specified semantic operators with evolution-guided code synthesis, significantly improving predictive performance across diverse tabular ML tasks while substantially reducing the complexity of both manual pipeline design and agent-generated solutions.

Technology Category

Application Category

📝 Abstract

Real-world machine learning on tabular data relies on complex data preparation pipelines for prediction, data integration, augmentation, and debugging. Designing these pipelines requires substantial domain expertise and engineering effort, motivating the question of how large language models (LLMs) can support tabular ML through code synthesis. We introduce SemPipes, a novel declarative programming model that integrates LLM-powered semantic data operators into tabular ML pipelines. Semantic operators specify data transformations in natural language while delegating execution to a runtime system. During training, SemPipes synthesizes custom operator implementations based on data characteristics, operator instructions, and pipeline context. This design enables the automatic optimization of data operations in a pipeline via LLM-based code synthesis guided by evolutionary search. We evaluate SemPipes across diverse tabular ML tasks and show that semantic operators substantially improve end-to-end predictive performance for both expert-designed and agent-generated pipelines, while reducing pipeline complexity. We implement SemPipes in Python and release it at https://github.com/deem-data/sempipes/tree/v1.

Problem

Research questions and friction points this paper is trying to address.

tabular machine learning

data preparation pipelines

large language models

semantic data operators

code synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic data operators

LLM-based code synthesis

tabular machine learning