🤖 AI Summary
This paper addresses the central copyright controversy surrounding whether generative AI “memorizes” training data—such as *New York Times* articles. To resolve the conceptual ambiguity and technical imprecision of “memorization” in legal discourse, the study introduces, for the first time, a rigorous four-criteria definition: (i) model-based reconstructability of training data, (ii) near-exact reproduction, (iii) coverage of a substantial portion of the original work, and (iv) causal origin in the training process—not inference-time behavior. Drawing on machine learning theory, empirical analysis, and copyright doctrine, it rigorously distinguishes memorization from extraction, regurgitation, and reconstruction, and establishes that a memorizing model constitutes a “copy” under copyright law. The resulting technical framework enables operationally grounded liability assessment, clarifies attribution logic, and informs both judicial reasoning and compliant model training practices.
📝 Abstract
The New York Times's copyright lawsuit against OpenAI and Microsoft alleges OpenAI's GPT models have"memorized"NYT articles. Other lawsuits make similar claims. But parties, courts, and scholars disagree on what memorization is, whether it is taking place, and what its copyright implications are. These debates are clouded by ambiguities over the nature of"memorization."We attempt to bring clarity to the conversation. We draw on the technical literature to provide a firm foundation for legal discussions, providing a precise definition of memorization: a model has"memorized"a piece of training data when (1) it is possible to reconstruct from the model (2) a near-exact copy of (3) a substantial portion of (4) that piece of training data. We distinguish memorization from"extraction"(user intentionally causes a model to generate a near-exact copy), from"regurgitation"(model generates a near-exact copy, regardless of user intentions), and from"reconstruction"(the near-exact copy can be obtained from the model by any means). Several consequences follow. (1) Not all learning is memorization. (2) Memorization occurs when a model is trained; regurgitation is a symptom not its cause. (3) A model that has memorized training data is a"copy"of that training data in the sense used by copyright. (4) A model is not like a VCR or other general-purpose copying technology; it is better at generating some types of outputs (possibly regurgitated ones) than others. (5) Memorization is not a phenomenon caused by"adversarial"users bent on extraction; it is latent in the model itself. (6) The amount of training data that a model memorizes is a consequence of choices made in training. (7) Whether or not a model that has memorized actually regurgitates depends on overall system design. In a very real sense, memorized training data is in the model--to quote Zoolander, the files are in the computer.