On the Edge of Memorization in Diffusion Models

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

The boundary between memorization (reproducing training data) and generalization (generating novel samples) in diffusion models remains poorly understood, yet this boundary critically determines copyright and privacy risks. Method: We establish a theoretical framework that—first for underparameterized diffusion models—reveals the critical mechanism governing the memorization–generalization phase transition: the transition is governed by the relative weighting of memory and generalization components in the training loss. We derive an analytical criterion to precisely predict the critical model size at which memorization dominates, and design a “mathematical laboratory” using synthetic and structured images to quantify the evolution of loss component weights via both theoretical analysis and gradient descent experiments. Contribution/Results: Experiments validate the high accuracy of our theoretical predictions, yielding explicit model capacity thresholds beyond which memorization becomes likely. This work provides the first analytically tractable and empirically verifiable theoretical foundation for designing safe, controllable diffusion models.

Technology Category

Application Category

📝 Abstract

When do diffusion models reproduce their training data, and when are they able to generate samples beyond it? A practically relevant theoretical understanding of this interplay between memorization and generalization may significantly impact real-world deployments of diffusion models with respect to issues such as copyright infringement and data privacy. In this work, to disentangle the different factors that influence memorization and generalization in practical diffusion models, we introduce a scientific and mathematical "laboratory" for investigating these phenomena in diffusion models trained on fully synthetic or natural image-like structured data. Within this setting, we hypothesize that the memorization or generalization behavior of an underparameterized trained model is determined by the difference in training loss between an associated memorizing model and a generalizing model. To probe this hypothesis, we theoretically characterize a crossover point wherein the weighted training loss of a fully generalizing model becomes greater than that of an underparameterized memorizing model at a critical value of model (under)parameterization. We then demonstrate via carefully-designed experiments that the location of this crossover predicts a phase transition in diffusion models trained via gradient descent, validating our hypothesis. Ultimately, our theory enables us to analytically predict the model size at which memorization becomes predominant. Our work provides an analytically tractable and practically meaningful setting for future theoretical and empirical investigations. Code for our experiments is available at https://github.com/DruvPai/diffusion_mem_gen.

Problem

Research questions and friction points this paper is trying to address.

Investigates memorization vs generalization in diffusion models

Examines factors influencing training data reproduction

Predicts critical model size for memorization predominance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces synthetic data laboratory for studying memorization

Theoretically characterizes crossover point for memorization behavior

Analytically predicts model size where memorization becomes predominant

🔎 Similar Papers

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon