🤖 AI Summary
To address the challenge that large language models (LLMs) often suffer degradation in general capabilities when selectively forgetting specific texts, this paper proposes an efficient unlearning framework. Methodologically, it introduces the Mean Teacher mechanism—previously used in semi-supervised learning—to the model unlearning task for the first time, and designs a Negative Log-Unlikelihood (NLUL) loss function to guide parameter updates along low-curvature natural gradient paths, thereby balancing unlearning efficacy and model stability. Crucially, NLUL circumvents gradient vanishing issues inherent in conventional natural gradient descent. Comprehensive evaluation on the MUSE benchmark demonstrates that our approach significantly outperforms state-of-the-art methods across key metrics: unlearning success rate, downstream task retention, and distributional consistency. It effectively mitigates performance degradation induced by unlearning, establishing a scalable and robust paradigm for controllable forgetting in LLMs.
📝 Abstract
One of the goals of language model unlearning is to reduce memorization of selected text instances while retaining the model's general abilities. Despite various proposed methods, reducing memorization of large datasets without noticeable degradation in model utility remains challenging. In this paper, we investigate the mean teacher algorithm (Tarvainen&Valpola, 2017), a simple proximal optimization method from continual learning literature that gradually modifies the teacher model. We show that the mean teacher can approximate a trajectory of a slow natural gradient descent (NGD), which inherently seeks low-curvature updates that are less likely to degrade the model utility. While slow NGD can suffer from vanishing gradients, we introduce a new unlearning loss called"negative log-unlikelihood"(NLUL) that avoids this problem. We show that the combination of mean teacher and NLUL improves some metrics on the MUSE benchmarks (Shi et al., 2024).