Probing Knowledge Holes in Unlearned LLMs

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work identifies the “knowledge void” problem in large language model (LLM) unlearning: while existing unlearning methods successfully erase targeted harmful knowledge without degrading performance on standard benchmarks, they often inadvertently impair semantically proximal benign knowledge—causing latent, undetectable knowledge degradation. To address this, the authors propose the first systematic neighborhood probing framework that automatically generates test cases covering knowledge semantically associated with the target to be unlearned, thereby overcoming limitations of conventional evaluation. Experiments across multiple LLMs and unlearning algorithms demonstrate that 98.7% of such neighborhood test cases exhibit significant response degradation (e.g., irrelevant or nonsensical outputs), whereas standard benchmarks fail entirely to detect these failures. This work shifts the unlearning evaluation paradigm from isolated target-knowledge assessment to holistic integrity evaluation of associated knowledge, establishing a critical methodological foundation for trustworthy model editing.

Technology Category

Application Category

📝 Abstract

Machine unlearning has emerged as a prevalent technical solution for selectively removing unwanted knowledge absorbed during pre-training, without requiring full retraining. While recent unlearning techniques can effectively remove undesirable content without severely compromising performance on standard benchmarks, we find that they may inadvertently create ``knowledge holes'' -- unintended losses of benign knowledge that standard benchmarks fail to capture. To probe where unlearned models reveal knowledge holes, we propose a test case generation framework that explores both immediate neighbors of unlearned content and broader areas of potential failures. Our evaluation demonstrates significant hidden costs of unlearning: up to 98.7% of the test cases yield irrelevant or nonsensical responses from unlearned models, despite being answerable by the pretrained model. These findings necessitate rethinking the conventional approach to evaluating knowledge preservation in unlearning, moving beyond standard, static benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Detecting unintended knowledge loss in unlearned language models

Evaluating hidden costs of machine unlearning techniques

Identifying knowledge holes beyond standard benchmark capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probing knowledge holes in unlearned language models

Test case generation for unlearning failure detection

Evaluating hidden knowledge loss beyond standard benchmarks

🔎 Similar Papers

No similar papers found.