Layered Unlearning for Adversarial Relearning

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper reveals a pervasive fragility in post-training modifications of language models—including fine-tuning, alignment, and unlearning—whose behavioral and representational changes are readily circumvented via prompt engineering or adversarial relearning. To address this, we propose Layer-wise Unlearning (LU), the first method to introduce a staged, hierarchical suppression mechanism: it progressively erases targeted knowledge subsets at distinct network layers while preserving all other data, thereby constraining local relearning’s ability to restore global model behavior. LU is grounded in circuit analysis assumptions, integrates seamlessly with mainstream unlearning algorithms, and is rigorously validated on both synthetic tasks and large language models. Experiments demonstrate that LU substantially enhances the robustness of diverse unlearning methods against adversarial relearning. Our work establishes a new benchmark for machine unlearning and provides interpretable insights into the structural dynamics of knowledge retention and removal in neural networks.

Technology Category

Application Category

📝 Abstract

Our goal is to understand how post-training methods, such as fine-tuning, alignment, and unlearning, modify language model behavior and representations. We are particularly interested in the brittle nature of these modifications that makes them easy to bypass through prompt engineering or relearning. Recent results suggest that post-training induces shallow context-dependent ``circuits'' that suppress specific response patterns. This could be one explanation for the brittleness of post-training. To test this hypothesis, we design an unlearning algorithm, Layered Unlearning (LU), that creates distinct inhibitory mechanisms for a growing subset of the data. By unlearning the first $i$ folds while retaining the remaining $k - i$ at the $i$th of $k$ stages, LU limits the ability of relearning on a subset of data to recover the full dataset. We evaluate LU through a combination of synthetic and large language model (LLM) experiments. We find that LU improves robustness to adversarial relearning for several different unlearning methods. Our results contribute to the state-of-the-art of machine unlearning and provide insight into the effect of post-training updates.

Problem

Research questions and friction points this paper is trying to address.

Understand how post-training modifies model behavior and representations

Test brittleness of post-training modifications via adversarial relearning

Develop Layered Unlearning to improve robustness against relearning attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layered Unlearning creates distinct inhibitory mechanisms

Unlearning folds while retaining others enhances robustness

Improves adversarial relearning resistance in LLMs

🔎 Similar Papers

No similar papers found.

Authors to Follow