Not All Tokens Are Meant to Be Forgotten

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

131K/year

🤖 AI Summary

During LLM pretraining, models often memorize privacy-sensitive or copyrighted content, posing compliance risks. Existing unlearning methods frequently suffer from “catastrophic forgetting”—erroneously erasing irrelevant tokens and severely degrading model utility. To address this, we propose Targeted Information Forgetting (TIF), the first framework enabling fine-grained disentanglement between unwanted words (e.g., PII or copyright-infringing terms) and general words in forgetting samples. TIF introduces a differentiable targeted identifier to locate harmful tokens and jointly optimizes a Logit Preference Loss to suppress their generation while applying a Preservation Loss to retain general semantic knowledge. Evaluated on the TOFU and MUSE benchmarks, TIF achieves state-of-the-art performance, striking the optimal trade-off between forgetting completeness and model utility—outperforming prior methods by significant margins across all metrics.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs), pre-trained on massive text corpora, exhibit remarkable human-level language understanding, reasoning, and decision-making abilities. However, they tend to memorize unwanted information, such as private or copyrighted content, raising significant privacy and legal concerns. Unlearning has emerged as a promising solution, but existing methods face a significant challenge of over-forgetting. This issue arises because they indiscriminately suppress the generation of all the tokens in forget samples, leading to a substantial loss of model utility. To overcome this challenge, we introduce the Targeted Information Forgetting (TIF) framework, which consists of (1) a flexible targeted information identifier designed to differentiate between unwanted words (UW) and general words (GW) in the forget samples, and (2) a novel Targeted Preference Optimization approach that leverages Logit Preference Loss to unlearn unwanted information associated with UW and Preservation Loss to retain general information in GW, effectively improving the unlearning process while mitigating utility degradation. Extensive experiments on the TOFU and MUSE benchmarks demonstrate that the proposed TIF framework enhances unlearning effectiveness while preserving model utility and achieving state-of-the-art results.

Problem

Research questions and friction points this paper is trying to address.

LLMs memorize unwanted private or copyrighted content

Existing unlearning methods cause excessive utility loss

Need to differentiate and selectively forget specific tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Targeted Information Forgetting framework

Differentiates unwanted and general words

Optimizes unlearning with preference losses

🔎 Similar Papers

No similar papers found.