Rethinking Memorization Measures and their Implications in Large Language Models

📅 2025-07-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether memorization is inherently unavoidable in optimal language learning by large language models (LLMs) and whether their real-world privacy risks are overestimated. We propose “contextual memorization” — a novel, fine-grained metric that distinguishes genuine data memorization from benign context utilization, thereby correcting false positives in prior recall-based memorization detection. Theoretically, we show this metric imposes stricter conditions than counterfactual memorization, clarifying fundamental distinctions among three memorization paradigms. Extensive experiments across 18 LLMs and multilingual entropy settings reveal: (i) optimal learning cannot fully eliminate memorization; (ii) improved learning capacity reduces contextual and counterfactual memorization but increases recall-based memorization; and (iii) most reported “memorized” content poses negligible privacy risk. Our core contributions are: a refined conceptualization of memorization, a more discriminative evaluation framework, and a rigorous re-assessment of privacy implications.

Technology Category

Application Category

📝 Abstract
Concerned with privacy threats, memorization in LLMs is often seen as undesirable, specifically for learning. In this paper, we study whether memorization can be avoided when optimally learning a language, and whether the privacy threat posed by memorization is exaggerated or not. To this end, we re-examine existing privacy-focused measures of memorization, namely recollection-based and counterfactual memorization, along with a newly proposed contextual memorization. Relating memorization to local over-fitting during learning, contextual memorization aims to disentangle memorization from the contextual learning ability of LLMs. Informally, a string is contextually memorized if its recollection due to training exceeds the optimal contextual recollection, a learned threshold denoting the best contextual learning without training. Conceptually, contextual recollection avoids the fallacy of recollection-based memorization, where any form of high recollection is a sign of memorization. Theoretically, contextual memorization relates to counterfactual memorization, but imposes stronger conditions. Memorization measures differ in outcomes and information requirements. Experimenting on 18 LLMs from 6 families and multiple formal languages of different entropy, we show that (a) memorization measures disagree on memorization order of varying frequent strings, (b) optimal learning of a language cannot avoid partial memorization of training strings, and (c) improved learning decreases contextual and counterfactual memorization but increases recollection-based memorization. Finally, (d) we revisit existing reports of memorized strings by recollection that neither pose a privacy threat nor are contextually or counterfactually memorized.
Problem

Research questions and friction points this paper is trying to address.

Examining if memorization can be avoided in optimal language learning
Assessing privacy threats posed by memorization in large language models
Comparing memorization measures and their outcomes in diverse LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces contextual memorization for LLMs
Re-examines existing memorization measures critically
Tests memorization across 18 LLMs experimentally
🔎 Similar Papers
No similar papers found.