🤖 AI Summary
Large language models (LLMs) exhibit susceptibility to training data contamination on benchmarks such as AIME and Math500; existing detection methods—relying on lexical overlap or perplexity—suffer from poor generalization and fail to identify implicit contamination. To address this, we propose a novel memorization detection paradigm based on **digit-token probability activation trajectories**: contaminated samples exhibit premature answer locking (“shortcut reasoning”) in early Transformer layers, whereas clean samples accumulate evidence incrementally across layers, yielding discriminative dynamic trajectories. We further validate the causal validity and authenticity of these trajectories via LoRA-based fine-tuning to inject controlled contamination samples. Our method significantly improves detection accuracy for implicit contamination, achieves superior cross-task generalization across multiple benchmarks, and provides an interpretable, empirically verifiable tool for trustworthy LLM evaluation.
📝 Abstract
Large language models (LLMs) are commonly evaluated on challenging benchmarks such as AIME and Math500, which are susceptible to contamination and risk of being memorized. Existing detection methods, which primarily rely on surface-level lexical overlap and perplexity, demonstrate low generalization and degrade significantly when encountering implicitly contaminated data. In this paper, we propose MemLens (An Activation Lens for Memorization Detection) to detect memorization by analyzing the probability trajectories of numeric tokens during generation. Our method reveals that contaminated samples exhibit ``shortcut'' behaviors, locking onto an answer with high confidence in the model's early layers, whereas clean samples show more gradual evidence accumulation across the model's full depth. We observe that contaminated and clean samples exhibit distinct and well-separated reasoning trajectories. To further validate this, we inject carefully designed samples into the model through LoRA fine-tuning and observe the same trajectory patterns as in naturally contaminated data. These results provide strong evidence that MemLens captures genuine signals of memorization rather than spurious correlations.