🤖 AI Summary
To mitigate privacy risks arising from large language models (LLMs) inadvertently memorizing and leaking sensitive training data (e.g., email addresses, URLs, SSNs), this paper proposes a gradient-free vocabulary-space rank editing method. The approach analyzes the rank structure within the token embedding space to precisely identify and edit critical local neuron subsets responsible for representing sensitive tokens, enabling fine-grained, verifiable information forgetting. It introduces a novel, gradient-free forgetting paradigm grounded in vocabulary-space rank analysis. Extensive evaluation is conducted across three real-world and synthetic sensitive datasets. Results show an average forgetting rate of 98.3%, a reduction in extraction attack success rate to <2.1%, and only a 0.7% degradation in downstream task performance—significantly outperforming existing data sanitization and model editing techniques. The method achieves efficient, robust erasure of sensitive information while preserving overall model integrity.
📝 Abstract
Language models (LMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in training data, causing privacy concerns. Current approaches to address this issue involve costly dataset scrubbing, or model filtering through unlearning and model editing, which can be bypassed through extraction attacks. We propose REVS, a novel non-gradient-based method for unlearning sensitive information from LMs. REVS identifies and modifies a small subset of neurons relevant for constituent tokens that form sensitive information. To adequately evaluate our method on truly sensitive information, we curate three datasets: email and URL datasets naturally memorized by the models, and a synthetic social security number dataset that we tune the models to memorize. Compared to other methods, REVS demonstrates superior performance in unlearning sensitive information and robustness to extraction attacks, while retaining underlying model integrity.