๐ค AI Summary
Current large language models still struggle with challenges in natural languageโdriven data preparation tasks, including ambiguous user intent, real-world dirty data, and the generation of interpretable workflows. To address this gap, this work proposes PrepBench, the first systematic benchmark specifically designed for data preparation, which evaluates three core capabilities: interactive disambiguation, data preparation code generation, and transformation of code into visual workflows. Built upon real-world, multi-domain datasets, PrepBench features complex operations spanning 3 to 18 steps and code tasks up to 300 lines long, thereby filling a critical void in existing code generation benchmarks that lack data preparation scenarios. Experimental results demonstrate that state-of-the-art models exhibit limited performance on these tasks, and PrepBench effectively uncovers key bottlenecks, offering a reliable evaluation standard for future research.
๐ Abstract
Data preparation is a central and time-consuming stage in data analysis workflows. Traditionally, commercial tools have relied on graphical user interfaces (GUIs) to simplify data preparation, allowing users to define transformations through visual operators and workflows. Recent advances in large language models (LLMs) raise the possibility of a paradigm shift toward natural language (NL)-driven data preparation, in which users can specify preparation intents in NL directly. However, it remains unclear how far current LLM-based agents are from this paradigm shift in practice. Existing code generation benchmarks do not capture key characteristics of data preparation, including ambiguous user intents, imperfect real-world data, and the need to translate code into interpretable workflows for validation. To bridge this gap, we present PrepBench, a benchmark designed to evaluate NL-driven data preparation along three core capabilities: interactive disambiguation, prep-code generation, and code-to-workflow translation. We crawl data from the Preppin' Data Challenges, and then extend it into a systematically designed benchmark. The benchmark covers diverse domains, and each task involves 3 to 18 data preparation steps. Nearly half of the tasks require over 100 lines of Python code, and the longest solutions approach 300 lines. Our evaluation shows that, despite recent progress, realizing this paradigm shift remains challenging for state-of-the-art LLMs. PrepBench provides a principled benchmark for measuring this gap and helps identify key challenges toward realizing NL-driven data preparation.