🤖 AI Summary
Large language models (LLMs) are prone to memorizing sensitive data during training; existing unlearning methods often suffer from “spurious neuron masking,” wherein target knowledge is merely hidden—not erased—leading to reacquisition during subsequent training and persistent privacy risks. This paper proposes Ssiuu, an attribution-guided unlearning method that explicitly suppresses negative attribution paths of target knowledge via neuron influence modeling and targeted regularization, enabling genuine, robust erasure of sensitive information. Unlike shallow alignment-based strategies, Ssiuu operates at the causal attribution level to disrupt the representational recurrence mechanism of unwanted knowledge. Experiments demonstrate that Ssiuu significantly outperforms strong baselines under adversarial data injection and instruction-following benchmark attacks. It achieves more durable and reliable unlearning, substantially mitigating unlearning reversal—the phenomenon where erased knowledge resurfaces after further training.
📝 Abstract
Large language models trained on web-scale data can memorize private or sensitive knowledge, raising significant privacy risks. Although some unlearning methods mitigate these risks, they remain vulnerable to "relearning" during subsequent training, allowing a substantial portion of forgotten knowledge to resurface. In this paper, we show that widely used unlearning methods cause shallow alignment: instead of faithfully erasing target knowledge, they generate spurious unlearning neurons that amplify negative influence to hide it. To overcome this limitation, we introduce Ssiuu, a new class of unlearning methods that employs attribution-guided regularization to prevent spurious negative influence and faithfully remove target knowledge. Experimental results confirm that our method reliably erases target knowledge and outperforms strong baselines across two practical retraining scenarios: (1) adversarial injection of private data, and (2) benign attack using an instruction-following benchmark. Our findings highlight the necessity of robust and faithful unlearning methods for safe deployment of language models.