Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing language model unlearning techniques struggle to simultaneously eliminate hazardous knowledge and preserve general capabilities. To address this, we propose Component-wise Intervention and Restructuring (CIR), a method that leverages PCA on module activations and gradients to precisely identify and compress the representation subspace most strongly associated with target facts, followed by localized parameter updates to achieve selective knowledge folding. CIR is the first approach to enable highly selective knowledge unlearning while maintaining high robustness and minimal interference with non-target knowledge. Experiments on Llama-3.1-8B using the WMDP benchmark demonstrate that CIR improves unlearning efficacy for biological and cybersecurity hazards by 80× and 30×, respectively, increases WikiText perplexity by only 0.1%, and requires less than 3 GPU-seconds per fact on average.

Technology Category

Application Category

📝 Abstract
Current unlearning techniques and safety training consistently fail to remove dangerous knowledge from language models. We analyze the root causes and propose a highly selective technique which unlearns robustly and without disrupting general performance. We perform PCA on activations and module output gradients to identify subspaces containing common representations, and collapse them before calculating unlearning updates. This way we avoid unlearning general representations, and only target those specific to the unlearned facts. When unlearning WMDP dataset facts from Llama-3.1-8B, we drop post-attack accuracy 80x more than our best baseline (Circuit Breakers) on biohazardous facts and 30x more on cyberhazardous facts. Despite this, we disrupt general performance 30x less (only 0.1% WikiText loss increase), while requiring less than 3 GPU-seconds per fact.
Problem

Research questions and friction points this paper is trying to address.

Removes dangerous knowledge without disrupting performance
Identifies and collapses irrelevant representation subspaces
Achieves robust unlearning with minimal computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collapses irrelevant representations via PCA
Targets specific facts without disrupting performance
Uses activation and gradient subspaces analysis
🔎 Similar Papers
No similar papers found.