The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation

📅 2025-12-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM unlearning evaluation methods rely solely on performance degradation on the target dataset ($D_u$), leading to “erasure hallucination”—where models appear to forget target knowledge while retaining semantically proximal information. Method: We propose the first stress-testing framework for evaluating unlearning generalization, constructing proxy datasets that are semantically derived yet embedding-separable from $D_u$. Proxy samples are generated via embedding-space perturbation coupled with semantic consistency constraints. We conduct multi-metric evaluations across three representative models (Llama-3, Qwen2.5, Zephyr) and three diverse data categories. Contribution/Results: Our framework reveals that mainstream evaluation metrics overestimate unlearning success by 42% on average and severely underestimate residual implicit knowledge. It exposes and rectifies the fundamental flaw of declaring unlearning successful based solely on $D_u$ performance—establishing semantic robustness as a new evaluation paradigm for unlearning assessment.

Technology Category

Application Category

📝 Abstract
Machine unlearning aims to remove specific data influences from trained models, a capability essential for adhering to copyright laws and ensuring AI safety. Current unlearning metrics typically measure success by monitoring the model's performance degradation on the specific unlearning dataset ($D_u$). We argue that for Large Language Models (LLMs), this evaluation paradigm is insufficient and potentially misleading. Many real-world uses of unlearning--motivated by copyright or safety--implicitly target not only verbatim content in $D_u$, but also behaviors influenced by the broader generalizations the model derived from it. We demonstrate that LLMs can pass standard unlearning evaluation and appear to have ``forgotten'' the target knowledge, while simultaneously retaining strong capabilities on content that is semantically adjacent to $D_u$. This phenomenon indicates that erasing exact sentences does not necessarily equate to removing the underlying knowledge. To address this gap, we propose ame, an automated stress-testing framework that generates a surrogate dataset, $ ilde{D}_u$. This surrogate set is constructed to be semantically derived from $D_u$ yet sufficiently distinct in embedding space. By comparing unlearning metric scores between $D_u$ and $ ilde{D}_u$, we can stress-test the reliability of the metric itself. Our extensive evaluation across three LLM families (Llama-3-8B, Qwen2.5-7B, and Zephyr-7B-$β$), three distinct datasets, and seven standard metrics reveals widespread inconsistencies. We find that current metrics frequently overestimate unlearning success, failing to detect retained knowledge exposed by our stress-test datasets.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM unlearning generalization beyond verbatim content
Tests if current metrics overestimate forgetting of semantic knowledge
Proposes stress-test framework to detect retained adjacent capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated stress-testing framework for unlearning evaluation
Generates semantically derived surrogate dataset for testing
Compares metric scores to detect overestimated unlearning success