Guess or Recall? Training CNNs to Classify and Localize Memorization in LLMs

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The mechanistic underpinnings of “word-by-word memorization” in large language models (LLMs) remain poorly understood, particularly regarding its internal diversity and attentional basis. Method: We propose a novel attention-weight analysis framework: training lightweight CNNs to classify multi-layer attention maps, augmented by customized visualization techniques to precisely localize attention regions associated with distinct memory behaviors. Contribution/Results: We find that conventional binary memory taxonomies (e.g., “recall vs. generation”) exhibit low alignment with attention mechanisms. In response, we introduce a neurobiologically grounded trichotomy—“guessing,” “retrieval,” and “non-memorization”—where “guessing” is formally established as an independent cognitive category. Empirical validation reveals that most purportedly “retrievable memories” are in fact generated rather than retrieved from stored knowledge. This taxonomy improves attention–behavior alignment by +32.7%, offering a new theoretical and methodological paradigm for memory modeling and interpretability research in LLMs.

Technology Category

Application Category

📝 Abstract
Verbatim memorization in Large Language Models (LLMs) is a multifaceted phenomenon involving distinct underlying mechanisms. We introduce a novel method to analyze the different forms of memorization described by the existing taxonomy. Specifically, we train Convolutional Neural Networks (CNNs) on the attention weights of the LLM and evaluate the alignment between this taxonomy and the attention weights involved in decoding. We find that the existing taxonomy performs poorly and fails to reflect distinct mechanisms within the attention blocks. We propose a new taxonomy that maximizes alignment with the attention weights, consisting of three categories: memorized samples that are guessed using language modeling abilities, memorized samples that are recalled due to high duplication in the training set, and non-memorized samples. Our results reveal that few-shot verbatim memorization does not correspond to a distinct attention mechanism. We also show that a significant proportion of extractable samples are in fact guessed by the model and should therefore be studied separately. Finally, we develop a custom visual interpretability technique to localize the regions of the attention weights involved in each form of memorization.
Problem

Research questions and friction points this paper is trying to address.

Analyze distinct forms of memorization in LLMs using attention weights
Evaluate alignment between existing taxonomy and attention mechanisms
Develop new taxonomy to classify memorized and non-memorized samples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Train CNNs on LLM attention weights
Propose new memorization taxonomy for alignment
Develop visual interpretability for memorization localization
J
Jérémie Dentan
LIX (École Polytechnique, IP Paris, CNRS)
Davide Buscaldi
Davide Buscaldi
Maître de conférences HDR, LIPN, Université Sorbonne Paris Nord
LLMsInformation RetrievalOntology LearningGeographic IRText Mining
S
Sonia Vanier
LIX (École Polytechnique, IP Paris, CNRS)