LLM Unlearning with LLM Beliefs

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) often memorize and regurgitate sensitive or harmful content; existing unlearning methods frequently induce a “squeezing effect”—where probability mass shifts toward semantically similar, high-likelihood paraphrases—resulting in illusory forgetting and misleading evaluation metrics. This work formally characterizes the squeezing effect for the first time and proposes a belief-steering-based unlearning framework: by jointly suppressing both target responses and the model’s own high-confidence generations via gradient ascent, we design two algorithms—BS-T (token-level) and BS-S (sequence-level). Extensive experiments across multiple LLMs and benchmarks demonstrate that our approach significantly reduces target content re-generation while preserving overall model utility. It achieves more thorough, verifiable, and genuine forgetting—advancing trustworthy model editing with a novel, principled paradigm.

Technology Category

Application Category

📝 Abstract
Large language models trained on vast corpora inherently risk memorizing sensitive or harmful content, which may later resurface in their outputs. Prevailing unlearning methods generally rely on gradient ascent and its variants to lower the probability of specific target responses. However, we find that this strategy induces a critical side effect: probability mass is redistributed into high-likelihood regions, often corresponding to semantically related rephrasings of the targets. We refer to this as the squeezing effect, which explains why many methods yield merely spurious unlearning, a problem further obscured by automated metrics (e.g., ROUGE, truth ratio) that misreport actual success. To address this, we propose a bootstrapping (BS) framework that explicitly links the squeezing effect with the model's own high-confidence generations, namely its model beliefs. Since model beliefs inherently capture the very high-likelihood regions where probability mass is squeezed, incorporating them into the unlearning objective directly counters the squeezing effect. By jointly suppressing both target responses and model beliefs, BS-T (token) attenuates high-probability tokens, whereas BS-S (sequence) removes entire high-confidence generations, together achieving more thorough forgetting while preserving utility. Extensive experiments across diverse benchmarks with various model families confirm the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Preventing harmful content memorization in large language models
Addressing semantic rephrasing leakage during unlearning processes
Correcting misleading automated metrics for true unlearning evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bootstrapping framework counters squeezing effect
Suppresses target responses and model beliefs
Achieves thorough forgetting while preserving utility
🔎 Similar Papers
No similar papers found.
K
Kemou Li
State Key Laboratory of Internet of Things and Smart City, University of Macau
Qizhou Wang
Qizhou Wang
PhD @ HKBU
machine learning
Y
Yue Wang
TMLR Group, Department of Computer Science, Hong Kong Baptist University
F
Fengpeng Li
State Key Laboratory of Internet of Things and Smart City, University of Macau
J
Jun Liu
State Key Laboratory of Internet of Things and Smart City, University of Macau
B
Bo Han
TMLR Group, Department of Computer Science, Hong Kong Baptist University
Jiantao Zhou
Jiantao Zhou
Professor, Department of Computer and Information Science, University of Macau
Information Forensics and SecurityMultimedia Signal ProcessingMachine Learning