Beyond Forgetting: Machine Unlearning Elicits Controllable Side Behaviors and Capabilities

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work investigates whether machine unlearning can simultaneously achieve targeted forgetting and induce controllable behavioral changes alongside capability enhancement. Building upon the linear representation hypothesis, the authors propose a “representation misdirection” method that linearly manipulates a one-dimensional high-level concept vector within the latent representations of to-be-forgotten samples. This approach enables precise modulation of model behaviors—such as truthfulness, sentiment orientation, and refusal tendencies—while also improving in-context learning performance. Extensive experiments across diverse tasks demonstrate the method’s effectiveness, not only enabling fine-grained control over model behavior but also significantly enhancing core capabilities. These findings reveal the dual nature of machine unlearning, highlighting both its potential risks and promising applications.

Technology Category

Application Category

📝 Abstract

We consider representation misdirection (RM), a class of LLM unlearning methods that achieves forgetting by manipulating the forget-representations, that is, latent representations of forget samples. Despite being important, the roles of target vectors used in RM, however, remain underexplored. Here, we approach and revisit RM through the lens of the linear representation hypothesis. Specifically, if one can somehow identify a one-dimensional representation corresponding to a high-level concept, the linear representation hypothesis enables linear operations on this concept vector within the forget-representation space. Under this view, we hypothesize that, beyond forgetting, machine unlearning elicits controllable side behaviors and stronger side capabilities corresponding to the high-level concept. Our hypothesis is empirically validated across a wide range of tasks, including behavioral control (e.g., controlling unlearned models'truth, sentiment, and refusal) and capability enhancement (e.g., improving unlearned models'in-context learning capability). Our findings reveal that this fairly attractive phenomenon could be either a hidden risk if misused or a mechanism that can be harnessed for developing models that require stronger capabilities and controllable behaviors.

Problem

Research questions and friction points this paper is trying to address.

machine unlearning

representation misdirection

linear representation hypothesis

side behaviors

side capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

machine unlearning

representation misdirection

linear representation hypothesis