Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) are prone to memorizing sensitive data during training; existing unlearning methods often suffer from “spurious neuron masking,” wherein target knowledge is merely hidden—not erased—leading to reacquisition during subsequent training and persistent privacy risks. This paper proposes Ssiuu, an attribution-guided unlearning method that explicitly suppresses negative attribution paths of target knowledge via neuron influence modeling and targeted regularization, enabling genuine, robust erasure of sensitive information. Unlike shallow alignment-based strategies, Ssiuu operates at the causal attribution level to disrupt the representational recurrence mechanism of unwanted knowledge. Experiments demonstrate that Ssiuu significantly outperforms strong baselines under adversarial data injection and instruction-following benchmark attacks. It achieves more durable and reliable unlearning, substantially mitigating unlearning reversal—the phenomenon where erased knowledge resurfaces after further training.

Technology Category

Application Category

📝 Abstract
Large language models trained on web-scale data can memorize private or sensitive knowledge, raising significant privacy risks. Although some unlearning methods mitigate these risks, they remain vulnerable to "relearning" during subsequent training, allowing a substantial portion of forgotten knowledge to resurface. In this paper, we show that widely used unlearning methods cause shallow alignment: instead of faithfully erasing target knowledge, they generate spurious unlearning neurons that amplify negative influence to hide it. To overcome this limitation, we introduce Ssiuu, a new class of unlearning methods that employs attribution-guided regularization to prevent spurious negative influence and faithfully remove target knowledge. Experimental results confirm that our method reliably erases target knowledge and outperforms strong baselines across two practical retraining scenarios: (1) adversarial injection of private data, and (2) benign attack using an instruction-following benchmark. Our findings highlight the necessity of robust and faithful unlearning methods for safe deployment of language models.
Problem

Research questions and friction points this paper is trying to address.

Preventing relearning of forgotten private knowledge in language models
Eliminating spurious unlearning neurons that hide target knowledge
Developing robust unlearning methods resistant to adversarial data injection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attribution-guided regularization prevents spurious negative influence
Faithfully removes target knowledge instead of hiding it
Reliably erases knowledge across adversarial retraining scenarios
🔎 Similar Papers
No similar papers found.