Generative Data Refinement: Just Ask for Better Data

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Facing data exhaustion in model training and regulatory constraints on private data usage, this paper proposes the Generative Data Refinement (GDR) framework—the first to leverage pretrained generative models for *conditional synthetic data transformation*. GDR simultaneously achieves web-scale data detoxification, anonymization, and distribution preservation without manual prompt engineering. It safely converts private datasets containing sensitive or harmful content into high-quality, privacy-compliant training corpora while preserving semantic diversity and statistical fidelity. Experimental results demonstrate that GDR significantly outperforms state-of-the-art industrial baselines on both anonymization and detoxification tasks: it successfully purifies high-risk datasets and improves downstream model performance across multiple benchmarks. By enabling scalable, privacy-preserving data curation, GDR establishes a new paradigm for sustainable large language model training—one that reconciles dataset expansion with stringent privacy and safety requirements.

Technology Category

Application Category

📝 Abstract

For a fixed parameter size, the capabilities of large models are primarily determined by the quality and quantity of its training data. Consequently, training datasets now grow faster than the rate at which new data is indexed on the web, leading to projected data exhaustion over the next decade. Much more data exists as user-generated content that is not publicly indexed, but incorporating such data comes with considerable risks, such as leaking private information and other undesirable content. We introduce a framework, Generative Data Refinement (GDR), for using pretrained generative models to transform a dataset with undesirable content into a refined dataset that is more suitable for training. Our experiments show that GDR can outperform industry-grade solutions for dataset anonymization, as well as enable direct detoxification of highly unsafe datasets. Moreover, we show that by generating synthetic data that is conditioned on each example in the real dataset, GDR's refined outputs naturally match the diversity of web scale datasets, and thereby avoid the often challenging task of generating diverse synthetic data via model prompting. The simplicity and effectiveness of GDR make it a powerful tool for scaling up the total stock of training data for frontier models.

Problem

Research questions and friction points this paper is trying to address.

Addressing projected data exhaustion for large models

Mitigating risks from unindexed user-generated content

Refining datasets to remove undesirable content safely

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Data Refinement framework transforms datasets

Uses pretrained generative models for data detoxification

Generates synthetic data matching web-scale diversity

🔎 Similar Papers

No similar papers found.