Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs

📅 2026-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional data preparation methods face limitations in semantic understanding and generalization, struggling to meet the rapidly growing demand for application-ready data. This work systematically reviews the application of large language models (LLMs) in three core tasks—data cleaning, integration, and augmentation—and proposes a task-centered taxonomy that, for the first time, delineates the evolutionary trajectory of LLM-driven data preparation techniques. Through a comprehensive literature review, the study examines key technologies such as prompt engineering, agent-based architectures, and semantic matching, alongside prevailing datasets and evaluation metrics. It highlights LLMs’ strengths in enhancing generalization and semantic comprehension while identifying critical challenges related to computational cost, hallucination, scalability, and the lack of standardized evaluation frameworks. The paper concludes by outlining a roadmap for future research and development in this emerging field.

Technology Category

Application Category

📝 Abstract
Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them, which is essential for a wide range of data-centric applications. Driven by (i) rising demands for application-ready data (e.g., for analytics, visualization, decision-making), (ii) increasingly powerful LLM techniques, and (iii) the emergence of infrastructures that facilitate flexible agent construction (e.g., using Databricks Unity Catalog), LLM-enhanced methods are rapidly becoming a transformative and potentially dominant paradigm for data preparation. By investigating hundreds of recent literature works, this paper presents a systematic review of this evolving landscape, focusing on the use of LLM techniques to prepare data for diverse downstream tasks. First, we characterize the fundamental paradigm shift, from rule-based, model-specific pipelines to prompt-driven, context-aware, and agentic preparation workflows. Next, we introduce a task-centric taxonomy that organizes the field into three major tasks: data cleaning (e.g., standardization, error processing, imputation), data integration (e.g., entity matching, schema matching), and data enrichment (e.g., data annotation, profiling). For each task, we survey representative techniques, and highlight their respective strengths (e.g., improved generalization, semantic understanding) and limitations (e.g., the prohibitive cost of scaling LLMs, persistent hallucinations even in advanced agents, the mismatch between advanced methods and weak evaluation). Moreover, we analyze commonly used datasets and evaluation metrics (the empirical part). Finally, we discuss open research challenges and outline a forward-looking roadmap that emphasizes scalable LLM-data systems, principled designs for reliable agentic workflows, and robust evaluation protocols.
Problem

Research questions and friction points this paper is trying to address.

data preparation
large language models
data cleaning
data integration
data enrichment
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-enhanced data preparation
agentic workflows
task-centric taxonomy
data cleaning
semantic understanding
🔎 Similar Papers
No similar papers found.
Wei Zhou
Wei Zhou
southwest jiaotong university
machine learningdata mining
J
Jun Zhou
Shanghai Jiao Tong University, Shanghai, China
Haoyu Wang
Haoyu Wang
University of Pennsylvania, Shanghai Jiao Tong University
Natural Language ProcessingComputer VisionKnowledge Graph
Z
Zhenghao Li
Shanghai Jiao Tong University, Shanghai, China
Q
Qikang He
Shanghai Jiao Tong University, Shanghai, China
S
Shaokun Han
Shanghai Jiao Tong University, Shanghai, China
Guoliang Li
Guoliang Li
Professor, Tsinghua University
DatabaseBig DataCrowdsourcingData Cleaning & Integration
Xuanhe Zhou
Xuanhe Zhou
Assistant Professor, Shanghai Jiao Tong University
Data ManagementArtificial Intelligence
Yeye He
Yeye He
Microsoft Research
Data ManagementData ExplorationData Preparation
Chunwei Liu
Chunwei Liu
Massachusetts Institute of Technology
DatabasesCompound AI SystemsLLMData CompressionIoT
Z
Zirui Tang
Shanghai Jiao Tong University, Shanghai, China
Bin Wang
Bin Wang
Pengcheng Laboratory
Cloud ComputingIIoTGreen ComputingComputer Architecture
S
Shen Tang
Xiaohongshu Inc.
K
Kai Zuo
Xiaohongshu Inc.
Yuyu Luo
Yuyu Luo
Assistant Professor, HKUST(GZ) / HKUST
Data AgentsLLM AgentsDatabaseText-to-SQLData-centric AI
Zhenzhe Zheng
Zhenzhe Zheng
Associate Professor, Department of Computer Science and Engineering, Shanghai Jiao Tong University
On-device Machine LearningGame TheoryDecision-Making
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence
Jingren Zhou
Jingren Zhou
Alibaba Group, Microsoft
Cloud ComputingLarge Scale Distributed SystemsMachine LearningQuery ProcessingQuery
Fan Wu
Fan Wu
Professor, Department of Computer Science and Engineering, Shanghai Jiao Tong University
Wireless NetworkingMobile ComputingAlgorithmic Game Theory and Its Applications