Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the vulnerability of large language models to relearning attacks after unlearning, wherein deleted knowledge can be rapidly recovered, posing significant security risks. From the perspective of representation geometry, this study is the first to reveal that minor components of representations play a crucial role in robustness against such attacks—yet existing unlearning methods predominantly perturb only the principal components, neglecting the minor ones. To remedy this, the authors propose Minor Component Unlearning (MCU), an algorithm that strategically perturbs the minor directions in the representation spectrum guided by optimization theory. Experimental results across three benchmark datasets demonstrate that MCU substantially outperforms state-of-the-art approaches, including sharpness-aware minimization, thereby significantly enhancing the robustness of machine unlearning.

📝 Abstract

Large language model (LLM) unlearning aims to remove specific data influences from pre-trained model without costly retraining, addressing privacy, copyright, and safety concerns. However, recent studies reveal a critical vulnerability: unlearned models rapidly recover "forgotten" knowledge through relearning attacks. This fragility raises serious security concerns, especially for open-weight models. In this work, we investigate the fundamental mechanism underlying this fragility from a representation geometry perspective. We discover that existing unlearning methods predominantly optimize along dominant components, leaving minor components largely unchanged. Critically, during relearning attacks, the modifications in these dominant components are easily reversed, enabling rapid knowledge recovery, whereas minor components exhibit stronger resistance to such reversal. We further provide a theoretical analysis that explains both observations from the spectral structure of representations. Building on this insight, we propose Minor Component Unlearning (MCU), a novel unlearning approach that explicitly targets minor components in representations. By concentrating unlearning effects in these inherently robust directions, our method achieves substantially improved resistance to relearning attacks. Extensive experiments on three datasets validate our approach, demonstrating significant improvements over state-of-the-art methods including sharpness-aware minimization.

Problem

Research questions and friction points this paper is trying to address.

LLM unlearning

relearning attacks

representation geometry

minor components

model security

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM unlearning

relearning attacks

representation geometry