🤖 AI Summary
Large language models (LLMs) risk privacy leakage and copyright infringement due to excessive memorization of training data, necessitating architectural decoupling of memorization from generalization. This work, grounded in architectural attribution analysis, is the first to systematically uncover a functional separation in Transformers: deep-layer attention mechanisms predominantly drive memorization, while shallow layers support reasoning. We propose an attention-bypass intervention method with provable theoretical guarantees—specifically, we rigorously prove bounded output perturbation under intervention. Through model attribution analysis, selective layer bypassing, layer-wise output difference modeling, and empirical validation across Pythia and GPT-Neo models on five benchmark datasets, our approach reduces memorization rate by 37% while preserving downstream task accuracy above 98% and maintaining full generalization performance. The method offers a plug-and-play, architecture-level solution for controllable memory suppression, backed by formal theoretical assurance.
📝 Abstract
Large Language Models (LLMs) are prevalent in modern applications but often memorize training data, leading to privacy breaches and copyright issues. Existing research has mainly focused on posthoc analyses, such as extracting memorized content or developing memorization metrics, without exploring the underlying architectural factors that contribute to memorization. In this work, we investigate memorization from an architectural lens by analyzing how attention modules at different layers impact its memorization and generalization performance. Using attribution techniques, we systematically intervene in the LLM architecture by bypassing attention modules at specific blocks while keeping other components like layer normalization and MLP transformations intact. We provide theorems analyzing our intervention mechanism from a mathematical view, bounding the difference in layer outputs with and without our attributions. Our theoretical and empirical analyses reveal that attention modules in deeper transformer blocks are primarily responsible for memorization, whereas earlier blocks are crucial for the models generalization and reasoning capabilities. We validate our findings through comprehensive experiments on different LLM families (Pythia and GPTNeo) and five benchmark datasets. Our insights offer a practical approach to mitigate memorization in LLMs while preserving their performance, contributing to safer and more ethical deployment in real world applications.