Data Pruning in Generative Diffusion Models

📅 2024-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study challenges the “more data is better” intuition in generative modeling by systematically investigating the impact of data pruning on diffusion model performance and fairness. To address class skew arising from redundant and noisy samples in large-scale datasets, we propose a lightweight, unsupervised clustering–based pruning method that assesses sample importance without labels and explicitly optimizes for balanced class distribution. Our approach is the first to demonstrate that moderate pruning significantly improves diffusion model performance—reducing FID by 5.2% and increasing generation diversity by 18.7% on CelebA-HQ and ImageNet—while simultaneously mitigating generation bias toward long-tailed classes. Fairness metrics, such as class-wise FID variance, improve by up to 32.4%. Compared to state-of-the-art pruning strategies, our method is more computationally efficient, scalable, and uniquely achieves joint gains in both fidelity and fairness.

Technology Category

Application Category

📝 Abstract
Data pruning is the problem of identifying a core subset that is most beneficial to training and discarding the remainder. While pruning strategies are well studied for discriminative models like those used in classification, little research has gone into their application to generative models. Generative models aim to estimate the underlying distribution of the data, so presumably they should benefit from larger datasets. In this work we aim to shed light on the accuracy of this statement, specifically answer the question of whether data pruning for generative diffusion models could have a positive impact. Contrary to intuition, we show that eliminating redundant or noisy data in large datasets is beneficial particularly when done strategically. We experiment with several pruning methods including recent-state-of-art methods, and evaluate over CelebA-HQ and ImageNet datasets. We demonstrate that a simple clustering method outperforms other sophisticated and computationally demanding methods. We further exhibit how we can leverage clustering to balance skewed datasets in an unsupervised manner to allow fair sampling for underrepresented populations in the data distribution, which is a crucial problem in generative models.
Problem

Research questions and friction points this paper is trying to address.

Data Pruning
Imbalanced Data
Diffusion Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Pruning
Diffusion Models
Fairness in AI
🔎 Similar Papers
No similar papers found.
R
Rania Briq
Juelich Supercomputing Centre
Jiangtao Wang
Jiangtao Wang
Coventry University, United Kingdom
AI for HealthCrowd SensingUbiquitous ComputingDigital Health
S
Steffan Kesselheim
Juelich Supercomputing Centre