LLM Unlearning Should Be Form-Independent

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing LLM unlearning methods heavily rely on the specific surface forms of training samples, exhibiting poor generalization and failing to handle the diverse linguistic expressions of knowledge in real-world scenarios. This paper formally defines the “form-dependency bias” and proposes Rank-one Concept Redirection (ROCR), a training-free, sub-second unlearning method. ROCR identifies hazardous concepts via concept activation analysis and redirects their representations through rank-one parameter remapping—enabling form-agnostic unlearning. It requires no retraining, auxiliary data, or architectural modification, and is compatible with mainstream LLMs. Evaluated on the newly constructed ORT benchmark, ROCR significantly outperforms prior approaches across three key dimensions: unlearning effectiveness, output fluency, and robustness to lexical and syntactic variations in target concept expression.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) unlearning aims to erase or suppress undesirable knowledge within the model, offering promise for controlling harmful or private information to prevent misuse. However, recent studies highlight its limited efficacy in real-world scenarios, hindering practical adoption. In this study, we identify a pervasive issue underlying many downstream failures: the effectiveness of existing unlearning methods heavily depends on the form of training samples and frequently fails to generalize to alternate expressions of the same knowledge. We formally characterize this problem as Form-Dependent Bias and systematically investigate its specific manifestation patterns across various downstream tasks. To quantify its prevalence and support future research, we introduce ORT, a novel benchmark designed to evaluate the robustness of unlearning methods against variations in knowledge expression. Results reveal that Form-Dependent Bias is both widespread and severe among current techniques. We argue that LLM unlearning should be form-independent to address the endless forms of downstream tasks encountered in real-world security-critical scenarios. Towards this goal, we introduce Rank-one Concept Redirection (ROCR), a novel training-free method, as a promising solution path. ROCR performs unlearning by targeting the invariants in downstream tasks, specifically the activated dangerous concepts. It is capable of modifying model parameters within seconds to redirect the model's perception of a specific unlearning target concept to another harmless concept. Extensive experiments demonstrate that ROCR significantly improves unlearning effectiveness compared to traditional methods while generating highly natural outputs.

Problem

Research questions and friction points this paper is trying to address.

LLM unlearning effectiveness depends on training sample form

Form-Dependent Bias limits generalization to alternate knowledge expressions

Current unlearning methods struggle with real-world security-critical scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Rank-one Concept Redirection (ROCR)

Targets invariants in downstream tasks

Modifies model parameters in seconds

🔎 Similar Papers

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning