Leverage Unlearning to Sanitize LLMs

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address privacy risks arising from large language models (LLMs) memorizing and leaking sensitive information—such as personally identifiable or confidential data—during fine-tuning, this paper proposes a two-stage model sanitization method that requires neither re-pretraining nor auxiliary safety data: *forgetting* (resetting critical neurons) followed by *repairing* (task-aware selective fine-tuning). This is the first approach to integrate neuron-level intervention with task-informed fine-tuning, enabling fine-grained erasure of both direct and indirect identifiers. Experiments on medical and general-purpose LLMs demonstrate that only a few forgetting iterations substantially reduce sensitive information leakage while preserving downstream task performance. The method offers an efficient, low-overhead, and scalable paradigm for LLM memory editing, advancing practical privacy-preserving model adaptation.

Technology Category

Application Category

📝 Abstract

Pre-trained large language models (LLMs) are becoming useful for various tasks. To improve their performance on certain tasks, it is necessary to fine-tune them on specific data corpora (e.g., medical reports, business data). These specialized data corpora may contain sensitive data (e.g., personal or confidential data) that will be memorized by the model and likely to be regurgitated during its subsequent use. This memorization of sensitive information by the model poses a significant privacy or confidentiality issue. To remove this memorization and sanitize the model without requiring costly additional fine-tuning on a secured data corpus, we propose SANI. SANI is an unlearning approach to sanitize language models. It relies on both an erasure and repair phases that 1) reset certain neurons in the last layers of the model to disrupt the memorization of fine-grained information, and then 2) fine-tune the model while avoiding memorizing sensitive information. We comprehensively evaluate SANI to sanitize both a model fine-tuned and specialized with medical data by removing directly and indirectly identifiers from the memorization of the model, and a standard pre-trained model by removing specific terms defined as confidential information from the model. Results show that with only few additional epochs of unlearning, the model is sanitized and the number of regurgitations is drastically reduced. This approach can be particularly useful for hospitals or other industries that have already spent significant resources training models on large datasets and wish to sanitize them before sharing.

Problem

Research questions and friction points this paper is trying to address.

Removing sensitive data memorization from fine-tuned LLMs

Sanitizing models without costly secure data retraining

Preventing confidential information regurgitation in specialized models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unlearning approach sanitizes language models

Resets neurons to disrupt memorized information

Fine-tunes model while avoiding sensitive data

🔎 Similar Papers

Learn and Unlearn in Multilingual LLMs