Bigger Isn't Always Memorizing: Early Stopping Overparameterized Diffusion Models

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the competition between generalization and memorization in highly overparameterized diffusion models. Method: We observe that generalization emerges rapidly in early training, while memorization dominates only later; we thus propose a dataset-size-scaled early-stopping criterion to dynamically balance these two phenomena. Theoretically, we establish the first time-scale competition framework for generalization versus memorization in diffusion models, revealing a universal linear scaling law wherein the onset of memorization dominance scales proportionally with dataset size, and we construct an interpretable phase diagram model. Experiments span multimodal (image and language) diffusion models, synthetic tasks based on random context-free grammars, and empirical validation of the early-stopping strategy. Results: Our data-dependent early-stopping rule significantly suppresses memorization, improves generalization performance, enhances privacy preservation, and boosts hyperparameter transferability across datasets and architectures.

Technology Category

Application Category

📝 Abstract
Diffusion probabilistic models have become a cornerstone of modern generative AI, yet the mechanisms underlying their generalization remain poorly understood. In fact, if these models were perfectly minimizing their training loss, they would just generate data belonging to their training set, i.e., memorize, as empirically found in the overparameterized regime. We revisit this view by showing that, in highly overparameterized diffusion models, generalization in natural data domains is progressively achieved during training before the onset of memorization. Our results, ranging from image to language diffusion models, systematically support the empirical law that memorization time is proportional to the dataset size. Generalization vs. memorization is then best understood as a competition between time scales. We show that this phenomenology is recovered in diffusion models learning a simple probabilistic context-free grammar with random rules, where generalization corresponds to the hierarchical acquisition of deeper grammar rules as training time grows, and the generalization cost of early stopping can be characterized. We summarize these results in a phase diagram. Overall, our results support that a principled early-stopping criterion - scaling with dataset size - can effectively optimize generalization while avoiding memorization, with direct implications for hyperparameter transfer and privacy-sensitive applications.
Problem

Research questions and friction points this paper is trying to address.

Understanding generalization vs memorization in overparameterized diffusion models
Identifying early-stopping criteria to optimize generalization and avoid memorization
Exploring time-scale competition in generalization across image and language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early stopping prevents memorization in diffusion models
Generalization scales with dataset size
Phase diagram summarizes training dynamics
🔎 Similar Papers
No similar papers found.
A
Alessandro Favero
EPFL, Lausanne, Switzerland
A
Antonio Sclocchi
Gatsby Unit, UCL, London, UK
Matthieu Wyart
Matthieu Wyart
Professor of Physics, Johns Hopkins
physics