A Closer Look at Machine Unlearning for Large Language Models

📅 2024-10-10
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
Large language models (LLMs) may inadvertently memorize sensitive or copyright-protected content, posing privacy and legal risks; yet retraining for removal is prohibitively expensive, necessitating efficient machine unlearning methods. This paper addresses both targeted and untargeted removal of specific content from LLMs by proposing a three-dimensional evaluation framework—assessing token diversity, semantic consistency, and factual correctness—and designing a maximum-entropy objective to improve untargeted unlearning. For targeted unlearning, we introduce an answer preservation loss to strengthen regularization against catastrophic forgetting. We validate our approach across three realistic unlearning scenarios: synthetic, continual, and real-world data. Results demonstrate significant improvements in unlearning efficacy and model fidelity, reduced hallucination, and preserved downstream task performance. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns. Due to the high cost of retraining from scratch, researchers attempt to employ machine unlearning to remove specific content from LLMs while preserving the overall performance. In this paper, we discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches. To address the issue of inadequate evaluation of model outputs after unlearning, we introduce three additional metrics to evaluate token diversity, sentence semantics, and factual correctness. We then categorize unlearning methods into untargeted and targeted, and discuss their issues respectively. Specifically, the behavior that untargeted unlearning attempts to approximate is unpredictable and may involve hallucinations, and existing regularization is insufficient for targeted unlearning. To alleviate these issues, we propose using the objective of maximizing entropy (ME) for untargeted unlearning and incorporate answer preservation (AP) loss as regularization for targeted unlearning. Experimental results across three scenarios, i.e., fictitious unlearning, continual unlearning, and real-world unlearning, demonstrate the effectiveness of our approaches. The code is available at https://github.com/sail-sg/closer-look-LLM-unlearning.
Problem

Research questions and friction points this paper is trying to address.

Addressing privacy and legal concerns in LLMs by removing sensitive content.
Evaluating unlearning effectiveness through token diversity, semantics, and factual correctness.
Proposing entropy maximization and answer preservation for improved unlearning methods.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces metrics for token diversity, semantics, correctness.
Proposes maximizing entropy for untargeted unlearning.
Uses answer preservation loss for targeted unlearning.