🤖 AI Summary
This work identifies the “knowledge void” problem in large language model (LLM) unlearning: while existing unlearning methods successfully erase targeted harmful knowledge without degrading performance on standard benchmarks, they often inadvertently impair semantically proximal benign knowledge—causing latent, undetectable knowledge degradation. To address this, the authors propose the first systematic neighborhood probing framework that automatically generates test cases covering knowledge semantically associated with the target to be unlearned, thereby overcoming limitations of conventional evaluation. Experiments across multiple LLMs and unlearning algorithms demonstrate that 98.7% of such neighborhood test cases exhibit significant response degradation (e.g., irrelevant or nonsensical outputs), whereas standard benchmarks fail entirely to detect these failures. This work shifts the unlearning evaluation paradigm from isolated target-knowledge assessment to holistic integrity evaluation of associated knowledge, establishing a critical methodological foundation for trustworthy model editing.
📝 Abstract
Machine unlearning has emerged as a prevalent technical solution for selectively removing unwanted knowledge absorbed during pre-training, without requiring full retraining. While recent unlearning techniques can effectively remove undesirable content without severely compromising performance on standard benchmarks, we find that they may inadvertently create ``knowledge holes'' -- unintended losses of benign knowledge that standard benchmarks fail to capture. To probe where unlearned models reveal knowledge holes, we propose a test case generation framework that explores both immediate neighbors of unlearned content and broader areas of potential failures. Our evaluation demonstrates significant hidden costs of unlearning: up to 98.7% of the test cases yield irrelevant or nonsensical responses from unlearned models, despite being answerable by the pretrained model. These findings necessitate rethinking the conventional approach to evaluating knowledge preservation in unlearning, moving beyond standard, static benchmarks.