Ambient Dataloops: Generative Models for Dataset Refinement

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an iterative framework for co-evolving datasets and diffusion models to address performance degradation caused by heterogeneous sample quality in modern datasets. By integrating a non-destructive data recycling mechanism with a noise-robust training strategy, the method synthesizes higher-quality samples at controlled noise levels in each iteration and effectively handles noisy data through Ambient Diffusion. Theoretical analysis and extensive experiments demonstrate that the proposed framework achieves state-of-the-art performance across diverse tasks, including unconditional and text-conditioned image generation as well as de novo protein design, significantly enhancing the robustness and generalization capabilities of generative models.

Technology Category

Application Category

📝 Abstract
We propose Ambient Dataloops, an iterative framework for refining datasets that makes it easier for diffusion models to learn the underlying data distribution. Modern datasets contain samples of highly varying quality, and training directly on such heterogeneous data often yields suboptimal models. We propose a dataset-model co-evolution process; at each iteration of our method, the dataset becomes progressively higher quality, and the model improves accordingly. To avoid destructive self-consuming loops, at each generation, we treat the synthetically improved samples as noisy, but at a slightly lower noisy level than the previous iteration, and we use Ambient Diffusion techniques for learning under corruption. Empirically, Ambient Dataloops achieve state-of-the-art performance in unconditional and text-conditional image generation and de novo protein design. We further provide a theoretical justification for the proposed framework that captures the benefits of the data looping procedure.
Problem

Research questions and friction points this paper is trying to address.

dataset refinement
data quality
generative models
diffusion models
data heterogeneity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ambient Dataloops
dataset refinement
diffusion models
co-evolution
Ambient Diffusion
🔎 Similar Papers
No similar papers found.
A
Adri'an Rodr'iguez-Munoz
Massachusetts Institute of Technology
W
William Daspit
The University of Texas at Austin
A
Adam Klivans
The University of Texas at Austin
Antonio Torralba
Antonio Torralba
Professor of Computer Science, MIT
visioncomputer vision
Constantinos Daskalakis
Constantinos Daskalakis
Professor of Computer Science, MIT
theoretical computer scienceeconomicsprobability theorylearningstatistics
G
G. Daras
Massachusetts Institute of Technology