Distillation Robustifies Unlearning

๐Ÿ“… 2025-06-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing large language model (LLM) unlearning methods exhibit poor robustness, being easily reversed by minimal fine-tuningโ€”revealing a fundamental limitation of output-alignment-only forgetting strategies. This paper proposes UNDO, the first framework to systematically demonstrate that knowledge distillation substantially enhances unlearning robustness. UNDO integrates noise injection with output-layer imitation training, enabling controllable trade-offs between computational cost and robustness. On synthetic tasks, UNDO achieves forgetting performance comparable to full retraining while consuming only 60โ€“80% of its compute and requiring just 0.01% of labeled pretraining data. It further validates effectiveness on the real-world WMDP benchmark. Key contributions include: (i) establishing the mechanistic role of distillation in strengthening unlearning robustness; (ii) introducing the first lightweight unlearning method that jointly optimizes efficiency, robustness, and practicality.

Technology Category

Application Category

๐Ÿ“ Abstract
Current LLM unlearning methods are not robust: they can be reverted easily with a few steps of finetuning. This is true even for the idealized unlearning method of training to imitate an oracle model that was never exposed to unwanted information, suggesting that output-based finetuning is insufficient to achieve robust unlearning. In a similar vein, we find that training a randomly initialized student to imitate an unlearned model transfers desired behaviors while leaving undesired capabilities behind. In other words, distillation robustifies unlearning. Building on this insight, we propose Unlearn-Noise-Distill-on-Outputs (UNDO), a scalable method that distills an unlearned model into a partially noised copy of itself. UNDO introduces a tunable tradeoff between compute cost and robustness, establishing a new Pareto frontier on synthetic language and arithmetic tasks. At its strongest setting, UNDO matches the robustness of a model retrained from scratch with perfect data filtering while using only 60-80% of the compute and requiring only 0.01% of the pretraining data to be labeled. We also show that UNDO robustifies unlearning on the more realistic Weapons of Mass Destruction Proxy (WMDP) benchmark. Since distillation is widely used in practice, incorporating an unlearning step beforehand offers a convenient path to robust capability removal.
Problem

Research questions and friction points this paper is trying to address.

Current LLM unlearning methods lack robustness and revert easily
Output-based finetuning fails to achieve robust unlearning in models
Proposing UNDO method to enhance unlearning robustness via distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distillation enhances unlearning robustness
UNDO method balances cost and robustness
Partial noising improves unlearning efficiency
๐Ÿ”Ž Similar Papers
No similar papers found.
Bruce W. Lee
Bruce W. Lee
University of Pennsylvania
A
Addie Foote
ML Alignment & Theory Scholars
Alex Infanger
Alex Infanger
Graduate Student, Stanford University
Applied ProbabilityMarkov chainsNumerical Linear AlgebraScientific Computing
L
Leni Shor
Massachusetts Institute of Technology, ML Alignment & Theory Scholars
H
Harish Kamath
ML Alignment & Theory Scholars
J
Jacob Goldman-Wetzler
Brown University, ML Alignment & Theory Scholars
B
Bryce Woodworth
ML Alignment & Theory Scholars
Alex Cloud
Alex Cloud
North Carolina State University
statisticsmachine learning
Alexander Matt Turner
Alexander Matt Turner
Research scientist, Google DeepMind
AI alignmentreinforcement learning