Beyond Forgetting: Machine Unlearning Elicits Controllable Side Behaviors and Capabilities

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether machine unlearning can simultaneously achieve targeted forgetting and induce controllable behavioral changes alongside capability enhancement. Building upon the linear representation hypothesis, the authors propose a “representation misdirection” method that linearly manipulates a one-dimensional high-level concept vector within the latent representations of to-be-forgotten samples. This approach enables precise modulation of model behaviors—such as truthfulness, sentiment orientation, and refusal tendencies—while also improving in-context learning performance. Extensive experiments across diverse tasks demonstrate the method’s effectiveness, not only enabling fine-grained control over model behavior but also significantly enhancing core capabilities. These findings reveal the dual nature of machine unlearning, highlighting both its potential risks and promising applications.

Technology Category

Application Category

📝 Abstract
We consider representation misdirection (RM), a class of LLM unlearning methods that achieves forgetting by manipulating the forget-representations, that is, latent representations of forget samples. Despite being important, the roles of target vectors used in RM, however, remain underexplored. Here, we approach and revisit RM through the lens of the linear representation hypothesis. Specifically, if one can somehow identify a one-dimensional representation corresponding to a high-level concept, the linear representation hypothesis enables linear operations on this concept vector within the forget-representation space. Under this view, we hypothesize that, beyond forgetting, machine unlearning elicits controllable side behaviors and stronger side capabilities corresponding to the high-level concept. Our hypothesis is empirically validated across a wide range of tasks, including behavioral control (e.g., controlling unlearned models'truth, sentiment, and refusal) and capability enhancement (e.g., improving unlearned models'in-context learning capability). Our findings reveal that this fairly attractive phenomenon could be either a hidden risk if misused or a mechanism that can be harnessed for developing models that require stronger capabilities and controllable behaviors.
Problem

Research questions and friction points this paper is trying to address.

machine unlearning
representation misdirection
linear representation hypothesis
side behaviors
side capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

machine unlearning
representation misdirection
linear representation hypothesis
controllable behaviors
capability enhancement
🔎 Similar Papers
No similar papers found.
D
Dang Huu-Tien
Japan Advanced Institute of Science and Technology
T
The-Hai Nguyen
Japan Advanced Institute of Science and Technology
D
Dinh Mai Phuong
Japan Advanced Institute of Science and Technology
Nguyen Minh Phuong
Nguyen Minh Phuong
Japan Advanced Institute of Science and Technology
NLPSemantic ParsingQAERCMachine Translation
H
Hoang Thanh-Tung
VNU University of Engineering and Technology, Vietnam
L
Le-Minh Nguyen
Japan Advanced Institute of Science and Technology
Naoya Inoue
Naoya Inoue
Japan Advanced Institute of Science and Technology (JAIST) / RIKEN AIP
Interpretability/Explainable AICommonsense ReasoningReading ComprehensionArgumentation