Keeping an Eye on LLM Unlearning: The Hidden Risk and Remedy

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a novel “forgetting-signal hijacking” vulnerability in fine-tuned large language models (LLMs): adversarially crafted unlearning requests can bind common benign tokens as forgetting triggers, causing unintended model responses to legitimate user inputs. To address this, we propose Scope-Aware Unlearning (SU), a method that enforces token-level scope regularization and decouples forgetting signals from semantic content—without requiring auxiliary data or architectural modifications. SU achieves spatial confinement of unlearning effects and enhanced robustness under a plug-and-play paradigm. Evaluated across multiple LLMs, SU reduces the success rate of stealthy attacks by over 70%, improves unlearning accuracy by 12.4%, and maintains original task performance within ±0.8% deviation. Moreover, SU is fully compatible with mainstream parameter-efficient fine-tuning paradigms such as LoRA.

Technology Category

Application Category

📝 Abstract
Although Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of tasks, growing concerns have emerged over the misuse of sensitive, copyrighted, or harmful data during training. To address these concerns, unlearning techniques have been developed to remove the influence of specific data without retraining from scratch. However, this paper reveals a critical vulnerability in fine-tuning-based unlearning: a malicious user can craft a manipulated forgetting request that stealthily degrades the model's utility for benign users. We demonstrate this risk through a red-teaming Stealthy Attack (SA), which is inspired by two key limitations of existing unlearning (the inability to constrain the scope of unlearning effect and the failure to distinguish benign tokens from unlearning signals). Prior work has shown that unlearned models tend to memorize forgetting data as unlearning signals, and respond with hallucinations or feigned ignorance when unlearning signals appear in the input. By subtly increasing the presence of common benign tokens in the forgetting data, SA enhances the connection between benign tokens and unlearning signals. As a result, when normal users include such tokens in their prompts, the model exhibits unlearning behaviors, leading to unintended utility degradation. To address this vulnerability, we propose Scope-aware Unlearning (SU), a lightweight enhancement that introduces a scope term into the unlearning objective, encouraging the model to localize the forgetting effect. Our method requires no additional data processing, integrates seamlessly with existing fine-tuning frameworks, and significantly improves robustness against SA. Extensive experiments validate the effectiveness of both SA and SU.
Problem

Research questions and friction points this paper is trying to address.

Exposes vulnerability in LLM unlearning to stealthy attacks
Reveals manipulated forgetting requests degrade model utility
Proposes scope-aware unlearning to localize forgetting effects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stealthy Attack exploits unlearning limitations
Scope-aware Unlearning localizes forgetting effect
Lightweight enhancement without extra data processing
🔎 Similar Papers
No similar papers found.