UCD: Unlearning in LLMs via Contrastive Decoding

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses machine unlearning for large language models (LLMs), targeting efficient removal of specific sensitive or undesirable knowledge while preserving overall model utility. To this end, we propose Inference-time Contrastive Decoding (ICD), the first method that employs two lightweight auxiliary models to dynamically compute output discrepancies during generation and thereby steer the main model to suppress target concepts. ICD integrates dual-auxiliary-model distillation with inference-time intervention, enabling concept-level forgetting without retraining. Evaluated on the TOFU and MUSE benchmarks, ICD achieves state-of-the-art performance: it improves unlearning quality by up to 23.6% and reduces utility degradation to under 0.8%. Unlike prior approaches requiring fine-tuning or architectural modification, ICD is fully inference-time, deployment-friendly, and preserves the original model intact. This work establishes a novel, practical paradigm for concept-level unlearning in production-ready LLMs.

Technology Category

Application Category

📝 Abstract

Machine unlearning aims to remove specific information, e.g. sensitive or undesirable content, from large language models (LLMs) while preserving overall performance. We propose an inference-time unlearning algorithm that uses contrastive decoding, leveraging two auxiliary smaller models, one trained without the forget set and one trained with it, to guide the outputs of the original model using their difference during inference. Our strategy substantially improves the tradeoff between unlearning effectiveness and model utility. We evaluate our approach on two unlearning benchmarks, TOFU and MUSE. Results show notable gains in both forget quality and retained performance in comparison to prior approaches, suggesting that incorporating contrastive decoding can offer an efficient, practical avenue for unlearning concepts in large-scale models.

Problem

Research questions and friction points this paper is trying to address.

Remove specific information from LLMs while preserving performance

Improve tradeoff between unlearning effectiveness and model utility

Evaluate unlearning benchmarks for forget quality and retained performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inference-time unlearning via contrastive decoding

Uses auxiliary models to guide original model

Improves tradeoff between unlearning and utility

🔎 Similar Papers

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning