MemLens: Uncovering Memorization in LLMs with Activation Trajectories

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit susceptibility to training data contamination on benchmarks such as AIME and Math500; existing detection methods—relying on lexical overlap or perplexity—suffer from poor generalization and fail to identify implicit contamination. To address this, we propose a novel memorization detection paradigm based on **digit-token probability activation trajectories**: contaminated samples exhibit premature answer locking (“shortcut reasoning”) in early Transformer layers, whereas clean samples accumulate evidence incrementally across layers, yielding discriminative dynamic trajectories. We further validate the causal validity and authenticity of these trajectories via LoRA-based fine-tuning to inject controlled contamination samples. Our method significantly improves detection accuracy for implicit contamination, achieves superior cross-task generalization across multiple benchmarks, and provides an interpretable, empirically verifiable tool for trustworthy LLM evaluation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are commonly evaluated on challenging benchmarks such as AIME and Math500, which are susceptible to contamination and risk of being memorized. Existing detection methods, which primarily rely on surface-level lexical overlap and perplexity, demonstrate low generalization and degrade significantly when encountering implicitly contaminated data. In this paper, we propose MemLens (An Activation Lens for Memorization Detection) to detect memorization by analyzing the probability trajectories of numeric tokens during generation. Our method reveals that contaminated samples exhibit ``shortcut'' behaviors, locking onto an answer with high confidence in the model's early layers, whereas clean samples show more gradual evidence accumulation across the model's full depth. We observe that contaminated and clean samples exhibit distinct and well-separated reasoning trajectories. To further validate this, we inject carefully designed samples into the model through LoRA fine-tuning and observe the same trajectory patterns as in naturally contaminated data. These results provide strong evidence that MemLens captures genuine signals of memorization rather than spurious correlations.
Problem

Research questions and friction points this paper is trying to address.

Detecting memorization in LLMs contaminated benchmark data
Overcoming limitations of surface-level lexical overlap methods
Identifying shortcut behaviors in contaminated samples trajectories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing numeric token probability trajectories
Detecting shortcut behaviors in early layers
Validating with LoRA fine-tuning injection
🔎 Similar Papers
No similar papers found.