Layered Unlearning for Adversarial Relearning

📅 2025-05-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper reveals a pervasive fragility in post-training modifications of language models—including fine-tuning, alignment, and unlearning—whose behavioral and representational changes are readily circumvented via prompt engineering or adversarial relearning. To address this, we propose Layer-wise Unlearning (LU), the first method to introduce a staged, hierarchical suppression mechanism: it progressively erases targeted knowledge subsets at distinct network layers while preserving all other data, thereby constraining local relearning’s ability to restore global model behavior. LU is grounded in circuit analysis assumptions, integrates seamlessly with mainstream unlearning algorithms, and is rigorously validated on both synthetic tasks and large language models. Experiments demonstrate that LU substantially enhances the robustness of diverse unlearning methods against adversarial relearning. Our work establishes a new benchmark for machine unlearning and provides interpretable insights into the structural dynamics of knowledge retention and removal in neural networks.

Technology Category

Application Category

📝 Abstract
Our goal is to understand how post-training methods, such as fine-tuning, alignment, and unlearning, modify language model behavior and representations. We are particularly interested in the brittle nature of these modifications that makes them easy to bypass through prompt engineering or relearning. Recent results suggest that post-training induces shallow context-dependent ``circuits'' that suppress specific response patterns. This could be one explanation for the brittleness of post-training. To test this hypothesis, we design an unlearning algorithm, Layered Unlearning (LU), that creates distinct inhibitory mechanisms for a growing subset of the data. By unlearning the first $i$ folds while retaining the remaining $k - i$ at the $i$th of $k$ stages, LU limits the ability of relearning on a subset of data to recover the full dataset. We evaluate LU through a combination of synthetic and large language model (LLM) experiments. We find that LU improves robustness to adversarial relearning for several different unlearning methods. Our results contribute to the state-of-the-art of machine unlearning and provide insight into the effect of post-training updates.
Problem

Research questions and friction points this paper is trying to address.

Understand how post-training modifies model behavior and representations
Test brittleness of post-training modifications via adversarial relearning
Develop Layered Unlearning to improve robustness against relearning attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layered Unlearning creates distinct inhibitory mechanisms
Unlearning folds while retaining others enhances robustness
Improves adversarial relearning resistance in LLMs
🔎 Similar Papers
No similar papers found.
T
Timothy Qian
MIT
V
Vinith Suriyakumar
MIT
A
Ashia Wilson
MIT
Dylan Hadfield-Menell
Dylan Hadfield-Menell
Massachusetts Institute of Technology
Artificial Intelligence