KUDA: Knowledge Unlearning by Deviating Representation for Large Language Models

πŸ“… 2026-02-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of removing sensitive, harmful, or copyrighted knowledge from large language models without degrading overall performance. Existing unlearning methods struggle to precisely excise specific knowledge while preserving model utility. To overcome this, the authors propose a knowledge-level unlearning mechanism grounded in causal tracing and representation shifting. The approach first identifies critical network layers storing the target knowledge via causal analysis, then formulates a representation shift objective to decouple model outputs from the original knowledge associations. Additionally, a relaxed null-space projection is introduced to mitigate optimization conflicts between forgetting and retaining tasks. Evaluated on the WMDP and MUSE benchmarks, the method significantly outperforms current techniques, effectively erasing targeted knowledge while maintaining the model’s general capabilities and output coherence.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) acquire a large amount of knowledge through pre-training on vast and diverse corpora. While this endows LLMs with strong capabilities in generation and reasoning, it amplifies risks associated with sensitive, copyrighted, or harmful content in training data.LLM unlearning, which aims to remove specific knowledge encoded within models, is a promising technique to reduce these risks. However, existing LLM unlearning methods often force LLMs to generate random or incoherent answers due to their inability to alter the encoded knowledge precisely. To achieve effective unlearning at the knowledge level of LLMs, we propose Knowledge Unlearning by Deviating representAtion (KUDA). We first utilize causal tracing to locate specific layers for target knowledge storage. We then design a new unlearning objective that induces the model's representations to deviate from its original position in the phase of knowledge removal, thus disrupting the ability to associate with the target knowledge. To resolve the optimization conflicts between forgetting and retention, we employ a relaxation null-space projection mechanism to mitigate the disruption to the representation space of retaining knowledge. Extensive experiments on representative benchmarks, WMDP and MUSE, demonstrate that KUDA outperforms most existing baselines by effectively balancing knowledge removal and model utility retention.
Problem

Research questions and friction points this paper is trying to address.

LLM unlearning
knowledge removal
representation deviation
model utility retention
sensitive content
Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge unlearning
representation deviation
causal tracing
null-space projection
large language models
πŸ”Ž Similar Papers
No similar papers found.