Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This paper identifies a novel backdoor threat to large language model (LLM) unlearning mechanisms: models exhibit normal forgetting behavior under standard evaluation but selectively recover erased knowledge upon activation of a specific trigger. Method: For open-weight LLMs, we propose and empirically validate the first backdoor unlearning attack leveraging the “attention settling” phenomenon—strategically embedding triggers within naturally low-activation regions of the attention mechanism to enhance stealth and robustness. Our approach integrates trigger localization with conventional backdoor training. Contribution/Results: We achieve targeted, high-fidelity knowledge recovery across multiple open-weight models. Experiments show no statistically significant behavioral deviation from standard unlearned models in the absence of the trigger (p > 0.05), while triggered recovery accuracy exceeds 92%. This is the first demonstration that the unlearning process itself can be maliciously manipulated, providing critical security insights and motivating new defensive paradigms for trustworthy LLM unlearning.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) unlearning has become a critical mechanism for removing undesired data, knowledge, or behaviors from pre-trained models while retaining their general utility. Yet, with the rise of open-weight LLMs, we ask: can the unlearning process itself be backdoored, appearing successful under normal conditions yet reverting to pre-unlearned behavior when a hidden trigger is activated? Drawing inspiration from classical backdoor attacks that embed triggers into training data to enforce specific behaviors, we investigate backdoor unlearning, where models forget as intended in the clean setting but recover forgotten knowledge when the trigger appears. We show that designing such attacks presents unique challenges, hinging on where triggers are placed and how backdoor training is reinforced. We uncover a strong link between backdoor efficacy and the attention sink phenomenon, i.e., shallow input tokens consistently attract disproportionate attention in LLMs. Our analysis reveals that these attention sinks serve as gateways for backdoor unlearning: placing triggers at sink positions and aligning their attention values markedly enhances backdoor persistence. Extensive experiments validate these findings, showing that attention-sink-guided backdoor unlearning reliably restores forgotten knowledge in the presence of backdoor triggers, while behaving indistinguishably from a normally unlearned model when triggers are absent. Code is available at https://github.com/OPTML-Group/Unlearn-Backdoor.

Problem

Research questions and friction points this paper is trying to address.

Backdooring LLM unlearning to revert forgotten knowledge via triggers

Investigating attention sink phenomenon as gateway for backdoor attacks

Ensuring backdoors persist only when specific hidden triggers activate

Innovation

Methods, ideas, or system contributions that make the work stand out.

Backdoor unlearning manipulates attention sink positions

Triggers at sink positions enhance backdoor persistence

Models revert to forgotten knowledge with hidden triggers

🔎 Similar Papers

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning