Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Large language models (LLMs) suffer from fragile machine unlearning, where downstream fine-tuning inadvertently reinstates previously deleted knowledge. Method: This paper proposes ILU, an invariant learning-based unlearning framework grounded in Invariant Risk Minimization (IRM), the first to introduce invariance principles into LLM unlearning to achieve cross-task robust forgetting—requiring only a single dataset for training while resisting heterogeneous fine-tuning attacks (e.g., mathematical reasoning, sentiment analysis). Results: On WMDP and MUSE benchmarks, ILU significantly outperforms NPO and RMU, achieving markedly higher forgetting success rates without compromising downstream task performance. Its core innovation lies in enforcing invariance of the unlearning process to fine-tuning perturbations, establishing a novel, generalizable, and robust paradigm for selective knowledge deletion.

Technology Category

Application Category

📝 Abstract

Machine unlearning offers a promising solution to privacy and safety concerns in large language models (LLMs) by selectively removing targeted knowledge while preserving utility. However, current methods are highly sensitive to downstream fine-tuning, which can quickly recover forgotten information-even from unrelated tasks. To address this, we introduce invariance into unlearning for the first time, inspired by invariant risk minimization (IRM). Building on this principle, we propose invariant LLM unlearning (ILU), a regularization-based framework that enhances robustness. Notably, ILU generalizes well to diverse fine-tuning tasks, even when trained using a single dataset. A task vector analysis is also provided to further elucidate the rationale behind ILU's effectiveness. Extensive experiments on the WMDP and MUSE benchmark, reveal that ILU significantly outperforms state-of-the-art unlearning methods, including negative preference optimization (NPO) and representation misdirection for unlearning (RMU). Notably, ILU achieves superior unlearning robustness across diverse downstream fine-tuning scenarios (e.g., math, paraphrase detection, and sentiment analysis) while preserving the fine-tuning performance.

Problem

Research questions and friction points this paper is trying to address.

Enhancing robustness of LLM unlearning against downstream fine-tuning

Preventing recovery of forgotten knowledge during unrelated fine-tuning

Improving unlearning resilience across diverse downstream tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces invariance into LLM unlearning

Uses invariant risk minimization (IRM) principle

Proposes regularization-based ILU framework

🔎 Similar Papers

The Remarkable Robustness of LLMs: Stages of Inference?