CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the security and privacy risks posed by memorized pretraining knowledge in large language models, which necessitates precise removal of undesirable knowledge while preserving general capabilities. To this end, the authors propose a negative preference alignment method based on token-level confidence calibration, which dynamically modulates the gradient update strength for tokens associated with undesirable knowledge, enabling fine-grained and efficient unlearning. Notably, this approach introduces a novel token-level confidence calibration mechanism that operates without requiring retained data or contrastive examples, ensuring robustness under data scarcity and variable input lengths. Experimental results demonstrate that the method significantly outperforms existing techniques on the MUSE and WMDP benchmarks, achieving a superior balance between effective forgetting and retention of useful knowledge.

Technology Category

Application Category

📝 Abstract

Pretrained knowledge memorized in LLMs raises critical concerns over safety and privacy, which has motivated LLM Unlearning as a technique for selectively removing the influences of undesirable knowledge. Existing approaches, rooted in Gradient Ascent (GA), often degrade general domain knowledge while relying on retention data or curated contrastive pairs, which can be either impractical or data and computationally prohibitive. Negative Preference Alignment has been explored for unlearning to tackle the limitations of GA, which, however, remains confined by its choice of reference model and shows undermined performance in realistic data settings. These limitations raise two key questions: i) Can we achieve effective unlearning that quantifies model confidence in undesirable knowledge and uses it to calibrate gradient updates more precisely, thus reducing catastrophic forgetting? ii) Can we make unlearning robust to data scarcity and length variation? We answer both questions affirmatively with CATNIP (Calibrated and Tokenized Negative Preference Alignment), a principled method that rescales unlearning effects in proportion to the model's token-level confidence, thus ensuring fine-grained control over forgetting. Extensive evaluations on MUSE and WMDP benchmarks demonstrated that our work enables effective unlearning without requiring retention data or contrastive unlearning response pairs, with stronger knowledge forgetting and preservation tradeoffs than state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

LLM Unlearning

Negative Preference Alignment

Catastrophic Forgetting

Data Scarcity

Knowledge Removal

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM Unlearning

Negative Preference Alignment

Token-level Calibration