Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

135K/year

🤖 AI Summary

Large language models (LLMs) are prone to memorizing sensitive data during training; existing unlearning methods often suffer from “spurious neuron masking,” wherein target knowledge is merely hidden—not erased—leading to reacquisition during subsequent training and persistent privacy risks. This paper proposes Ssiuu, an attribution-guided unlearning method that explicitly suppresses negative attribution paths of target knowledge via neuron influence modeling and targeted regularization, enabling genuine, robust erasure of sensitive information. Unlike shallow alignment-based strategies, Ssiuu operates at the causal attribution level to disrupt the representational recurrence mechanism of unwanted knowledge. Experiments demonstrate that Ssiuu significantly outperforms strong baselines under adversarial data injection and instruction-following benchmark attacks. It achieves more durable and reliable unlearning, substantially mitigating unlearning reversal—the phenomenon where erased knowledge resurfaces after further training.

Technology Category

Application Category

📝 Abstract

Large language models trained on web-scale data can memorize private or sensitive knowledge, raising significant privacy risks. Although some unlearning methods mitigate these risks, they remain vulnerable to "relearning" during subsequent training, allowing a substantial portion of forgotten knowledge to resurface. In this paper, we show that widely used unlearning methods cause shallow alignment: instead of faithfully erasing target knowledge, they generate spurious unlearning neurons that amplify negative influence to hide it. To overcome this limitation, we introduce Ssiuu, a new class of unlearning methods that employs attribution-guided regularization to prevent spurious negative influence and faithfully remove target knowledge. Experimental results confirm that our method reliably erases target knowledge and outperforms strong baselines across two practical retraining scenarios: (1) adversarial injection of private data, and (2) benign attack using an instruction-following benchmark. Our findings highlight the necessity of robust and faithful unlearning methods for safe deployment of language models.

Problem

Research questions and friction points this paper is trying to address.

Preventing relearning of forgotten private knowledge in language models

Eliminating spurious unlearning neurons that hide target knowledge

Developing robust unlearning methods resistant to adversarial data injection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attribution-guided regularization prevents spurious negative influence

Faithfully removes target knowledge instead of hiding it

Reliably erases knowledge across adversarial retraining scenarios

🔎 Similar Papers

No similar papers found.