Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning

📅 2024-06-19

📈 Citations: 4

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work exposes a critical semantic-level vulnerability in existing large language model (LLM) approximate unlearning methods: sensitive knowledge—such as bioweapon details or copyrighted text—can be reliably reinstated via benign relearning attacks using only small amounts of irrelevant public data (e.g., Wikipedia entries or medical articles), demonstrating that current approaches implement output suppression rather than genuine knowledge erasure. To formalize this threat, the authors introduce the first rigorous “unlearning–relearning” attack paradigm and systematically evaluate its robustness across three major unlearning paradigms—influence-function-based, gradient-ascent-based, and model-editing-based methods. Experiments on multiple LLMs consistently reproduce memory recovery, confirming the absence of semantic robustness in state-of-the-art unlearning. The core contributions are: (i) uncovering the fundamental conceptual flaw in prevailing unlearning mechanisms; (ii) proposing a principled, empirically validated attack framework; and (iii) providing both theoretical caution and empirical evidence that true knowledge erasure remains unrealized.

Technology Category

Application Category

📝 Abstract

Machine unlearning is a promising approach to mitigate undesirable memorization of training data in ML models. However, in this work we show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of $ extit{benign relearning attacks}$. With access to only a small and potentially loosely related set of data, we find that we can ''jog'' the memory of unlearned models to reverse the effects of unlearning. For example, we show that relearning on public medical articles can lead an unlearned LLM to output harmful knowledge about bioweapons, and relearning general wiki information about the book series Harry Potter can force the model to output verbatim memorized text. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study. Our work indicates that current approximate unlearning methods simply suppress the model outputs and fail to robustly forget target knowledge in the LLMs.

Problem

Research questions and friction points this paper is trying to address.

Existing unlearning methods in LLMs are vulnerable to benign relearning attacks.

Relearning on loosely related data can reverse unlearning effects in LLMs.

Current unlearning approaches suppress outputs but fail to robustly forget knowledge.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Explores benign relearning attacks on LLMs

Formalizes unlearning-relearning pipeline in models

Assesses unlearning robustness across benchmarks

🔎 Similar Papers

No similar papers found.