Model Unlearning Objectives Vary for Distinct Language Functions

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the critical need for targeted unlearning of undesirable capabilities—such as encoding harmful knowledge and generating toxic text—that large language models acquire during pretraining. The study is the first to systematically distinguish forgetting objectives across distinct linguistic functionalities and proposes tailored unlearning strategies accordingly: a cosine-similarity-based meta-learning variant of RMU for hazardous knowledge, and a multi-layer probe-guided optimization approach for toxic content. Experiments on four open-source 7–8B parameter models demonstrate significant forgetting efficacy for both task types, substantiating the necessity and effectiveness of designing unlearning methods aligned with specific linguistic functions. This advances model unlearning from isolated tasks toward a problem-family paradigm.

📝 Abstract

Large language models (LLMs) learn undesirable properties during pretraining, including dangerous knowledge and toxic text generation. Just as post-training uses different objectives to shape different behaviors, we argue that unlearning methods should be designed for the language function at issue. To study this, we consider two mechanistically distinct unlearning goals, dangerous-knowledge unlearning and toxicity unlearning. For dangerous knowledge, we introduce a cosine-based, meta-learned variant of RMU. For toxicity, we propose a multi-layer objective based on layer-specific probe directions. Across four open-source 7-8B models, our methods achieve strong results, based on distinct training objectives for the two types of unlearning. Overall, our results suggest that unlearning should be studied as a family of problems, analogous to the multiple types of LLM post-training.

Problem

Research questions and friction points this paper is trying to address.

model unlearning

dangerous knowledge

toxicity

language functions

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

model unlearning

language functions

dangerous knowledge