🤖 AI Summary
This paper reveals a pervasive fragility in post-training modifications of language models—including fine-tuning, alignment, and unlearning—whose behavioral and representational changes are readily circumvented via prompt engineering or adversarial relearning. To address this, we propose Layer-wise Unlearning (LU), the first method to introduce a staged, hierarchical suppression mechanism: it progressively erases targeted knowledge subsets at distinct network layers while preserving all other data, thereby constraining local relearning’s ability to restore global model behavior. LU is grounded in circuit analysis assumptions, integrates seamlessly with mainstream unlearning algorithms, and is rigorously validated on both synthetic tasks and large language models. Experiments demonstrate that LU substantially enhances the robustness of diverse unlearning methods against adversarial relearning. Our work establishes a new benchmark for machine unlearning and provides interpretable insights into the structural dynamics of knowledge retention and removal in neural networks.
📝 Abstract
Our goal is to understand how post-training methods, such as fine-tuning, alignment, and unlearning, modify language model behavior and representations. We are particularly interested in the brittle nature of these modifications that makes them easy to bypass through prompt engineering or relearning. Recent results suggest that post-training induces shallow context-dependent ``circuits'' that suppress specific response patterns. This could be one explanation for the brittleness of post-training. To test this hypothesis, we design an unlearning algorithm, Layered Unlearning (LU), that creates distinct inhibitory mechanisms for a growing subset of the data. By unlearning the first $i$ folds while retaining the remaining $k - i$ at the $i$th of $k$ stages, LU limits the ability of relearning on a subset of data to recover the full dataset. We evaluate LU through a combination of synthetic and large language model (LLM) experiments. We find that LU improves robustness to adversarial relearning for several different unlearning methods. Our results contribute to the state-of-the-art of machine unlearning and provide insight into the effect of post-training updates.