Inference-Time Machine Unlearning via Gated Activation Redirection

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the pressing need for efficient, reversible unlearning methods that avoid retraining large language models, which inherently memorize training data and thereby pose privacy, copyright, and security risks. The authors propose a training- and gradient-free inference-time unlearning mechanism that adaptively gates activations to apply norm-preserving rotational transformations in the residual stream, precisely removing the influence of specified data without altering model weights. This approach enables, for the first time, dynamic and localized activation steering, circumventing the side effects of global interventions and supporting continual unlearning even in quantized models. Evaluated on TOFU and MUSE benchmarks across three model scales, the method consistently outperforms twelve gradient-based baselines, effectively suppressing memorization while preserving model utility and maintaining robustness under quantization.

📝 Abstract

Large Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model performance, ideally approximating a model retrained from scratch without the forget set. Existing approaches aim to achieve this by updating model parameters via gradient-based methods. However, these updates are computationally expensive, lead to irreversible weight changes, and degrade when the model is quantized for deployment. A recent alternative to changing model weights is activation engineering, where activations are changed during inference to steer model behavior. Despite circumventing weight editing, naive activation steering introduces its own failure modes, as a single global steering vector applies the same intervention to every input, leading to unintended changes in model behavior. We introduce Inference-Time Unlearning via Gated Activation Redirection (GUARD-IT), a training- and gradient-free method that unlearns via input-dependent activation steering at inference time. The resulting intervention is applied as a norm-preserving rotation in the residual stream, leaving model weights untouched. Experiments on TOFU and MUSE show that GUARD-IT matches or exceeds 12 gradient-based baselines across three model scales, while being the only method to simultaneously preserve utility, suppress memorization, and avoid catastrophic collapse across all settings. GUARD-IT further supports continual unlearning without retraining, and remains effective under quantization, a scenario in which parameter-editing methods degrade.

Problem

Research questions and friction points this paper is trying to address.

machine unlearning

large language models

privacy

activation steering

inference-time intervention

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time unlearning

activation steering

gated redirection