From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work exposes a critical security vulnerability in large language model unlearning methods: fine-tuning solely on the retained dataset can restore unlearned knowledge—e.g., classification accuracy on forgotten classes in vision models—from ~50% to nearly 100%, rendering them susceptible to relearning attacks. We first identify that L2 distance in weight space and linear mode connectivity quantitatively predict a model’s resistance to such relearning. Leveraging this insight, we propose a novel unlearning framework based on weight-space regularization, jointly optimizing example-level unlearning objectives with L2 constraints on parameter updates. Our method preserves unlearning efficacy while substantially enhancing robustness against relearning. Extensive experiments across multiple benchmarks demonstrate state-of-the-art defensive performance: post-unlearning accuracy on forgotten classes remains irrecoverable via retained-set fine-tuning.

Technology Category

Application Category

📝 Abstract

Recent unlearning methods for LLMs are vulnerable to relearning attacks: knowledge believed-to-be-unlearned re-emerges by fine-tuning on a small set of (even seemingly-unrelated) examples. We study this phenomenon in a controlled setting for example-level unlearning in vision classifiers. We make the surprising discovery that forget-set accuracy can recover from around 50% post-unlearning to nearly 100% with fine-tuning on just the retain set -- i.e., zero examples of the forget set. We observe this effect across a wide variety of unlearning methods, whereas for a model retrained from scratch excluding the forget set (gold standard), the accuracy remains at 50%. We observe that resistance to relearning attacks can be predicted by weight-space properties, specifically, $L_2$-distance and linear mode connectivity between the original and the unlearned model. Leveraging this insight, we propose a new class of methods that achieve state-of-the-art resistance to relearning attacks.

Problem

Research questions and friction points this paper is trying to address.

Study vulnerability of LLM unlearning to relearning attacks

Investigate forget-set accuracy recovery via retain-set fine-tuning

Propose weight-space regularization for tamper-resistant unlearning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weight-space regularization for tamper-resistant unlearning

L2-distance and linear mode connectivity predict resistance

State-of-the-art resistance to relearning attacks

🔎 Similar Papers

Unified Neural Backdoor Removal with Only Few Clean Samples through Unlearning and Relearning