Towards Robust and Parameter-Efficient Knowledge Unlearning for LLMs

📅 2024-08-13

📈 Citations: 1

✨ Influential: 0

career value

169K/year

🤖 AI Summary

To address privacy risks and catastrophic forgetting arising from sensitive-data unlearning in large language models (LLMs), this paper proposes an efficient and robust machine unlearning method. The approach integrates gradient ascent optimization with low-rank adaptation (LoRA) to avoid full-parameter retraining. Its key contributions are: (1) an inverted hinge loss function that suppresses generation of harmful tokens while preserving textual fluency; and (2) a LoRA initialization strategy adaptively weighted by relative Fisher information, enabling precise identification and updating of critical parameters. Evaluated on the TOFU and Training Data Extraction Challenge benchmarks, the method achieves significantly higher sensitive-information removal rates, maintains stable inference and generation capabilities, reduces trainable parameter updates by over 60%, and incurs no noticeable performance degradation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated strong reasoning and memorization capabilities via pretraining on massive textual corpora. However, this poses risk of privacy and copyright violations, highlighting the need for efficient machine unlearning methods that remove sensitive data without retraining from scratch. While Gradient Ascent (GA) is commonly used to unlearn by reducing the likelihood of generating unwanted content, it leads to unstable optimization and catastrophic forgetting of retrained knowledge. We find that combining GA with low-rank adaptation results in poor trade-offs between computational cost and generative performance. To address these challenges, we propose two novel techniques for robust and efficient unlearning for LLMs. First, we introduce Inverted Hinge Loss, which suppresses unwanted tokens while maintaining fluency by boosting the probability of the next most likely token. Second, we develop a data-adaptive initialization for LoRA adapters via low-rank approximation weighted with relative Fisher information, thereby focusing updates on parameters critical for removing targeted knowledge. Experiments on the Training Data Extraction Challenge dataset using GPT-Neo models as well as on the TOFU benchmark with Phi-1.5B and Llama2-7B models demonstrate that our approach effectively removes sensitive information while maintaining reasoning and generative capabilities with minimal impact. Our implementation can be found in https://github.com/csm9493/efficient-llm-unlearning.

Problem

Research questions and friction points this paper is trying to address.

Develop efficient unlearning methods for LLMs

Address privacy and copyright risks in LLMs

Maintain model performance while removing sensitive data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inverted Hinge Loss suppresses unwanted tokens

Data-adaptive initialization for LoRA adapters

Low-rank approximation with Fisher information

🔎 Similar Papers

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning