Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization

📅 2025-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Despite safety fine-tuning, large language models (LLMs) retain recoverable hazardous knowledge, and existing unlearning methods remain vulnerable to reversal attacks. Method: This paper proposes an irreversible robust unlearning framework centered on *disruption masking*: parameter updates occur only where the gradient signs for unlearning and preservation align, integrated with gradient normalization and a meta-learning optimization framework to intrinsically enforce irreversibility. The approach combines sign-consistency masking, multi-stage unlearning optimization, and adversarial evaluation. Contribution/Results: On hazardous capability recovery defense benchmarks, our method achieves a 40% improvement over the prior state-of-the-art TAR, establishing a new SOTA in robust unlearning while guaranteeing irreversible removal of dangerous knowledge.

Technology Category

Application Category

📝 Abstract
Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning. We introduce Disruption Masking, a technique in which we only allow updating weights, where the signs of the unlearning gradient and the retaining gradient are the same. This ensures all updates are non-disruptive. Additionally, we identify the need for normalizing the unlearning gradients, and also confirm the usefulness of meta-learning. We combine these insights into MUDMAN (Meta-Unlearning with Disruption Masking and Normalization) and validate its effectiveness at preventing the recovery of dangerous capabilities. MUDMAN outperforms the prior TAR method by 40%, setting a new state-of-the-art for robust unlearning.
Problem

Research questions and friction points this paper is trying to address.

Preventing recovery of dangerous knowledge in language models
Ensuring irreversible unlearning with non-disruptive updates
Improving robustness of unlearning methods beyond prior techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disruption Masking prevents disruptive weight updates
Normalization of unlearning gradients enhances effectiveness
Meta-learning integration improves irreversible unlearning
🔎 Similar Papers
F
Filip Sondej
Jagiellonian University
Yushi Yang
Yushi Yang
Stanford University
MEMSsensorsmeasurementfabrication
M
Mikolaj Kniejski
University of Warsaw
M
Marcel Windys
Independent