Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

šŸ“… 2026-05-15
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF

career value

212K/year
šŸ¤– AI Summary
Existing forgetting methods for large language models often fail to fully remove knowledge, leading to the reappearance of harmful content during inference, or introduce bias by merely masking specific tokens. This work proposes the Differentiable Deletion (D²) paradigm, which unifies knowledge erasure and refusal mechanisms for the first time. D² precisely separates knowledge to be deleted from retained knowledge via an energy boundary in the latent space and introduces an energy metric to quantify the presence and separability of knowledge. By integrating Energy-based Unlearning Alignment (EUA) during training with an energy-based rejection mechanism at inference, D² achieves genuine knowledge erasure while ensuring safe responses. Experiments demonstrate that EUA effectively forgets sensitive knowledge and significantly outperforms existing approaches, maintaining both overall model performance and safety.
šŸ“ Abstract
Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD-based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion ($\mathrm{D^2}$), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement $\mathrm{D^2}$, we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of $\mathrm{D^2}$. Our code is available at https://github.com/Puning97/EUA-for-LLM-Unlearning.
Problem

Research questions and friction points this paper is trying to address.

Knowledge Erasure
Refusal Mechanism
LLM Unlearning
Sensitive Knowledge
Harmful Outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distinguishable Deletion
Energy-based Unlearning Alignment
Knowledge Erasure
Refusal Mechanism
Latent Representation
šŸ”Ž Similar Papers
2024-06-22International Conference on Computational LinguisticsCitations: 4
2024-05-26arXiv.orgCitations: 5