A mean teacher algorithm for unlearning of language models

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the challenge that large language models (LLMs) often suffer degradation in general capabilities when selectively forgetting specific texts, this paper proposes an efficient unlearning framework. Methodologically, it introduces the Mean Teacher mechanism—previously used in semi-supervised learning—to the model unlearning task for the first time, and designs a Negative Log-Unlikelihood (NLUL) loss function to guide parameter updates along low-curvature natural gradient paths, thereby balancing unlearning efficacy and model stability. Crucially, NLUL circumvents gradient vanishing issues inherent in conventional natural gradient descent. Comprehensive evaluation on the MUSE benchmark demonstrates that our approach significantly outperforms state-of-the-art methods across key metrics: unlearning success rate, downstream task retention, and distributional consistency. It effectively mitigates performance degradation induced by unlearning, establishing a scalable and robust paradigm for controllable forgetting in LLMs.

Technology Category

Application Category

📝 Abstract

One of the goals of language model unlearning is to reduce memorization of selected text instances while retaining the model's general abilities. Despite various proposed methods, reducing memorization of large datasets without noticeable degradation in model utility remains challenging. In this paper, we investigate the mean teacher algorithm (Tarvainen&Valpola, 2017), a simple proximal optimization method from continual learning literature that gradually modifies the teacher model. We show that the mean teacher can approximate a trajectory of a slow natural gradient descent (NGD), which inherently seeks low-curvature updates that are less likely to degrade the model utility. While slow NGD can suffer from vanishing gradients, we introduce a new unlearning loss called"negative log-unlikelihood"(NLUL) that avoids this problem. We show that the combination of mean teacher and NLUL improves some metrics on the MUSE benchmarks (Shi et al., 2024).

Problem

Research questions and friction points this paper is trying to address.

Reduce memorization of selected text in language models

Maintain model utility while unlearning large datasets

Improve unlearning metrics using mean teacher and NLUL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mean teacher algorithm for gradual model updates

Negative log-unlikelihood loss prevents vanishing gradients

Combination improves metrics on MUSE benchmarks

🔎 Similar Papers

Unlearnable Algorithms for In-context Learning