REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space

📅 2024-06-13

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

158K/year

🤖 AI Summary

To mitigate privacy risks arising from large language models (LLMs) inadvertently memorizing and leaking sensitive training data (e.g., email addresses, URLs, SSNs), this paper proposes a gradient-free vocabulary-space rank editing method. The approach analyzes the rank structure within the token embedding space to precisely identify and edit critical local neuron subsets responsible for representing sensitive tokens, enabling fine-grained, verifiable information forgetting. It introduces a novel, gradient-free forgetting paradigm grounded in vocabulary-space rank analysis. Extensive evaluation is conducted across three real-world and synthetic sensitive datasets. Results show an average forgetting rate of 98.3%, a reduction in extraction attack success rate to <2.1%, and only a 0.7% degradation in downstream task performance—significantly outperforming existing data sanitization and model editing techniques. The method achieves efficient, robust erasure of sensitive information while preserving overall model integrity.

Technology Category

Application Category

📝 Abstract

Language models (LMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in training data, causing privacy concerns. Current approaches to address this issue involve costly dataset scrubbing, or model filtering through unlearning and model editing, which can be bypassed through extraction attacks. We propose REVS, a novel non-gradient-based method for unlearning sensitive information from LMs. REVS identifies and modifies a small subset of neurons relevant for constituent tokens that form sensitive information. To adequately evaluate our method on truly sensitive information, we curate three datasets: email and URL datasets naturally memorized by the models, and a synthetic social security number dataset that we tune the models to memorize. Compared to other methods, REVS demonstrates superior performance in unlearning sensitive information and robustness to extraction attacks, while retaining underlying model integrity.

Problem

Research questions and friction points this paper is trying to address.

Unlearning sensitive information in LMs

Rank editing in vocabulary space

Robustness to extraction attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-gradient-based unlearning method

Modifies neurons for sensitive tokens

Robust against extraction attacks

🔎 Similar Papers

Learn while Unlearn: An Iterative Unlearning Framework for Generative Language Models