LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of processing low-quality data in high-privacy domains such as healthcare, where manual curation is costly and risks privacy leakage. The authors propose a fully automated data processing framework powered by large language model (LLM) agents that autonomously constructs and iteratively refines data-cleaning strategies without accessing raw data. The framework integrates distribution-preserving sampling, binary classification for identifying low-quality samples, iterative in-context learning, and cache reuse mechanisms to simultaneously preserve privacy, enhance efficiency, and improve data quality. Experimental results demonstrate that models trained on data processed by this framework achieve win rates exceeding 80%, significantly outperforming an LLM-driven AutoML baseline (~65%) while reducing search time by up to tenfold.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (DP). In practice, DP strategies are typically developed through iterative manual analysis and trial-and-error adjustment. These processes inevitably incur high labor costs and may lead to privacy issues in high-privacy domains like healthcare due to direct human access to sensitive data. Thus, achieving automated data processing without exposing the raw data has become a critical challenge. To address this challenge, we propose LLM-AutoDP, a novel framework that leverages LLMs as agents to automatically generate and optimize data processing strategies. Our method generates multiple candidate strategies and iteratively refines them using feedback signals and comparative evaluations. This iterative in-context learning mechanism enables the agent to converge toward high-quality processing pipelines without requiring direct human intervention or access to the underlying data. To further accelerate strategy search, we introduce three key techniques: Distribution Preserving Sampling, which reduces data volume while maintaining distributional integrity; Processing Target Selection, which uses a binary classifier to identify low-quality samples for focused processing; Cache-and-Reuse Mechanism}, which minimizes redundant computations by reusing prior processing results. Results show that models trained on data processed by our framework achieve over 80% win rates against models trained on unprocessed data. Compared to AutoML baselines based on LLM agents, LLM-AutoDP achieves approximately a 65% win rate. Moreover, our acceleration techniques reduce the total searching time by up to 10 times, demonstrating both effectiveness and efficiency.
Problem

Research questions and friction points this paper is trying to address.

automatic data processing
LLM agents
model fine-tuning
privacy-preserving
low-quality data
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM agents
automatic data processing
in-context learning
distribution preserving sampling
cache-and-reuse mechanism
🔎 Similar Papers
No similar papers found.
Wei Huang
Wei Huang
Google, Inc
Program AnalysisType InferenceWeb/Mobile Security
A
Anda Cheng
Ant Group, Beijing, China
Y
Yinggui Wang
Ant Group, Beijing, China
L
Lei Wang
Ant Group, Beijing, China
Tao Wei
Tao Wei
Vice President, Ant Financial
Software EngineeringSystem SecurityOperating SystemProgramming Language