Rethinking Memorization Measures and their Implications in Large Language Models

📅 2025-07-19

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work investigates whether memorization is inherently unavoidable in optimal language learning by large language models (LLMs) and whether their real-world privacy risks are overestimated. We propose “contextual memorization” — a novel, fine-grained metric that distinguishes genuine data memorization from benign context utilization, thereby correcting false positives in prior recall-based memorization detection. Theoretically, we show this metric imposes stricter conditions than counterfactual memorization, clarifying fundamental distinctions among three memorization paradigms. Extensive experiments across 18 LLMs and multilingual entropy settings reveal: (i) optimal learning cannot fully eliminate memorization; (ii) improved learning capacity reduces contextual and counterfactual memorization but increases recall-based memorization; and (iii) most reported “memorized” content poses negligible privacy risk. Our core contributions are: a refined conceptualization of memorization, a more discriminative evaluation framework, and a rigorous re-assessment of privacy implications.

Technology Category

Application Category

📝 Abstract

Concerned with privacy threats, memorization in LLMs is often seen as undesirable, specifically for learning. In this paper, we study whether memorization can be avoided when optimally learning a language, and whether the privacy threat posed by memorization is exaggerated or not. To this end, we re-examine existing privacy-focused measures of memorization, namely recollection-based and counterfactual memorization, along with a newly proposed contextual memorization. Relating memorization to local over-fitting during learning, contextual memorization aims to disentangle memorization from the contextual learning ability of LLMs. Informally, a string is contextually memorized if its recollection due to training exceeds the optimal contextual recollection, a learned threshold denoting the best contextual learning without training. Conceptually, contextual recollection avoids the fallacy of recollection-based memorization, where any form of high recollection is a sign of memorization. Theoretically, contextual memorization relates to counterfactual memorization, but imposes stronger conditions. Memorization measures differ in outcomes and information requirements. Experimenting on 18 LLMs from 6 families and multiple formal languages of different entropy, we show that (a) memorization measures disagree on memorization order of varying frequent strings, (b) optimal learning of a language cannot avoid partial memorization of training strings, and (c) improved learning decreases contextual and counterfactual memorization but increases recollection-based memorization. Finally, (d) we revisit existing reports of memorized strings by recollection that neither pose a privacy threat nor are contextually or counterfactually memorized.

Problem

Research questions and friction points this paper is trying to address.

Examining if memorization can be avoided in optimal language learning

Assessing privacy threats posed by memorization in large language models

Comparing memorization measures and their outcomes in diverse LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces contextual memorization for LLMs

Re-examines existing memorization measures critically

Tests memorization across 18 LLMs experimentally

🔎 Similar Papers

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon