An Adversarial Perspective on Machine Unlearning for AI Safety

📅 2024-09-26

📈 Citations: 14

✨ Influential: 1

career value

207K/year

🤖 AI Summary

This work investigates the security vulnerabilities of “machine unlearning”—a hazardous capability of large language models (LLMs). We demonstrate that state-of-the-art unlearning methods, such as RMU, are highly susceptible to adversarial bypasses, exhibiting weaker security guarantees than conventional safety fine-tuning. First, we systematically establish that jailbreak techniques can be adapted to recover unlearned harmful capabilities. Second, we propose two novel adaptive recovery methods: (i) targeted directional removal grounded in activation-space geometry, and (ii) few-shot fine-tuning requiring only ten irrelevant examples. Experiments show that both approaches efficiently restore the majority of unlearned harmful behaviors on RMU-edited models, exposing severe robustness deficiencies in current machine unlearning mechanisms. This study provides the first systematic adversarial evaluation framework for LLM unlearning, along with empirical evidence of its fragility, thereby advancing the development of truly attack-resilient unlearning techniques.

Technology Category

Application Category

📝 Abstract

Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Harmful Information Forgetting

AI Security

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial Forgetting

Model Security

Recovery Techniques

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?