Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

📅 2024-06-25

🏛️ arXiv.org

📈 Citations: 12

✨ Influential: 2

career value

172K/year

🤖 AI Summary

Language model memory is often oversimplified as a homogeneous phenomenon, neglecting sample-specific characteristics and heterogeneous model–corpus interactions. Method: We propose the *memory heterogeneity hypothesis*, decomposing memory into three distinct types: *parroting* (highly repetitive sequences), *reconstructing* (highly predictable sequences), and *reminiscing* (low-repetition, low-predictability sequences). Leveraging causal-inspired feature engineering, we develop a category-aware, multi-factor logistic regression model for interpretable, cross-category attribution. Contribution/Results: This work establishes the first memory taxonomy driven jointly by sample attributes and model–corpus co-adaptation. We identify distinct dominant factors per type—e.g., repetition rate for parroting, local entropy for reminiscing—and achieve an AUC of 0.89, significantly outperforming homogeneous baselines. Our framework enables fine-grained, mechanistic understanding of memorization behavior in large language models.

Technology Category

Application Category

📝 Abstract

Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.

Problem

Research questions and friction points this paper is trying to address.

Model memorization as multifaceted, not homogeneous

Classify memorization into recitation, reconstruction, recollection

Predict memorization using taxonomic category factors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model memorization via multifaceted factors analysis

Taxonomy: recitation, reconstruction, recollection

Predictive model for memorization likelihood

🔎 Similar Papers

Memorization In In-Context Learning