Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

📅 2024-06-25
🏛️ arXiv.org
📈 Citations: 12
Influential: 2
📄 PDF
🤖 AI Summary
Language model memory is often oversimplified as a homogeneous phenomenon, neglecting sample-specific characteristics and heterogeneous model–corpus interactions. Method: We propose the *memory heterogeneity hypothesis*, decomposing memory into three distinct types: *parroting* (highly repetitive sequences), *reconstructing* (highly predictable sequences), and *reminiscing* (low-repetition, low-predictability sequences). Leveraging causal-inspired feature engineering, we develop a category-aware, multi-factor logistic regression model for interpretable, cross-category attribution. Contribution/Results: This work establishes the first memory taxonomy driven jointly by sample attributes and model–corpus co-adaptation. We identify distinct dominant factors per type—e.g., repetition rate for parroting, local entropy for reminiscing—and achieve an AUC of 0.89, significantly outperforming homogeneous baselines. Our framework enables fine-grained, mechanistic understanding of memorization behavior in large language models.

Technology Category

Application Category

📝 Abstract
Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.
Problem

Research questions and friction points this paper is trying to address.

Model memorization as multifaceted, not homogeneous
Classify memorization into recitation, reconstruction, recollection
Predict memorization using taxonomic category factors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model memorization via multifaceted factors analysis
Taxonomy: recitation, reconstruction, recollection
Predictive model for memorization likelihood
🔎 Similar Papers
2024-08-21arXiv.orgCitations: 1
U
USVSN Sai Prashanth
EleutherAI, Microsoft
A
Alvin Deng
EleutherAI, DatologyAI
K
Kyle O'Brien
EleutherAI, New York University
V
V JyothirS
EleutherAI
Mohammad Aflah Khan
Mohammad Aflah Khan
Research Software Engineer @ MPI-SWS | OSS @ EleutherAI
Natural Language ProcessingDeep LearningLarge Language ModelsNLP for Social Good
J
Jaydeep Borkar
Northeastern University
Christopher A. Choquette-Choo
Christopher A. Choquette-Choo
OpenAI
machine learningtrustworthy machine learningdata privacyadversarial machine learningsecurity
J
Jacob Ray Fuehne
University of Illinois at Urbana-Champaign
Stella Biderman
Stella Biderman
EleutherAI
Natural Language ProcessingArtificial IntelligenceLanguage ModelingDeep Learning
T
Tracy Ke
Harvard University
Katherine Lee
Katherine Lee
Researcher, OpenAI
natural language processingmachine learningprivacyml securitytech policy
N
Naomi Saphra
Harvard University, Kempner Institute