Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the problem of labor-intensive, non-generalizable tabular data cleaning. We propose the first end-to-end autonomous agent framework powered by large language models (LLMs), which cleans data solely by generating and executing Python code—without altering downstream ML training pipelines or feature engineering. The agent integrates intra-row contextual understanding with iterative feedback reinforcement, enabling automatic detection and correction of single-row logical errors and outliers, without predefined rules or domain knowledge. Evaluated on multiple Kaggle benchmark datasets with synthetically injected errors, our approach significantly improves accuracy of downstream classification and regression models. However, it remains challenging for cross-row distributional errors (e.g., global trend deviations). Our core contribution is the first LLM-driven, fully automated, executable, and feedback-enhanced tabular data cleaning paradigm.

Technology Category

Application Category

📝 Abstract

High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or improper data integration across multiple sources that can severely degrade model performance. Detecting and correcting these issues typically require tailor-made solutions and demand extensive domain expertise. Consequently, automation is challenging, rendering the process labor-intensive and tedious. In this study, we investigate whether Large Language Models (LLMs) can help alleviate the burden of manual data cleaning. We set up an experiment in which an LLM, paired with Python, is tasked with cleaning the training dataset to improve the performance of a learning algorithm without having the ability to modify the training pipeline or perform any feature engineering. We run this experiment on multiple Kaggle datasets that have been intentionally corrupted with errors. Our results show that LLMs can identify and correct erroneous entries, such as illogical values or outlier, by leveraging contextual information from other features within the same row, as well as feedback from previous iterations. However, they struggle to detect more complex errors that require understanding data distribution across multiple rows, such as trends and biases.

Problem

Research questions and friction points this paper is trying to address.

Automating error detection in ML datasets using LLMs.

Improving dataset quality without modifying training pipelines.

Addressing limitations of LLMs in detecting complex data errors.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs clean datasets using contextual row information

Python-integrated LLMs correct errors without pipeline changes

LLMs identify outliers but struggle with complex trends

🔎 Similar Papers

CleanAgent: Automating Data Standardization with LLM-based Agents