Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter

πŸ“… 2026-05-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

211K/year
πŸ€– AI Summary
This work addresses the vulnerability of large language models to relearning attacks after unlearning, wherein deleted knowledge can be rapidly recovered, posing significant security risks. From the perspective of representation geometry, this study is the first to reveal that minor components of representations play a crucial role in robustness against such attacksβ€”yet existing unlearning methods predominantly perturb only the principal components, neglecting the minor ones. To remedy this, the authors propose Minor Component Unlearning (MCU), an algorithm that strategically perturbs the minor directions in the representation spectrum guided by optimization theory. Experimental results across three benchmark datasets demonstrate that MCU substantially outperforms state-of-the-art approaches, including sharpness-aware minimization, thereby significantly enhancing the robustness of machine unlearning.
πŸ“ Abstract
Large language model (LLM) unlearning aims to remove specific data influences from pre-trained model without costly retraining, addressing privacy, copyright, and safety concerns. However, recent studies reveal a critical vulnerability: unlearned models rapidly recover "forgotten" knowledge through relearning attacks. This fragility raises serious security concerns, especially for open-weight models. In this work, we investigate the fundamental mechanism underlying this fragility from a representation geometry perspective. We discover that existing unlearning methods predominantly optimize along dominant components, leaving minor components largely unchanged. Critically, during relearning attacks, the modifications in these dominant components are easily reversed, enabling rapid knowledge recovery, whereas minor components exhibit stronger resistance to such reversal. We further provide a theoretical analysis that explains both observations from the spectral structure of representations. Building on this insight, we propose Minor Component Unlearning (MCU), a novel unlearning approach that explicitly targets minor components in representations. By concentrating unlearning effects in these inherently robust directions, our method achieves substantially improved resistance to relearning attacks. Extensive experiments on three datasets validate our approach, demonstrating significant improvements over state-of-the-art methods including sharpness-aware minimization.
Problem

Research questions and friction points this paper is trying to address.

LLM unlearning
relearning attacks
representation geometry
minor components
model security
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM unlearning
relearning attacks
representation geometry
minor components
MCU