Inference-Time Machine Unlearning via Gated Activation Redirection

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

189K/year
🤖 AI Summary
This work addresses the pressing need for efficient, reversible unlearning methods that avoid retraining large language models, which inherently memorize training data and thereby pose privacy, copyright, and security risks. The authors propose a training- and gradient-free inference-time unlearning mechanism that adaptively gates activations to apply norm-preserving rotational transformations in the residual stream, precisely removing the influence of specified data without altering model weights. This approach enables, for the first time, dynamic and localized activation steering, circumventing the side effects of global interventions and supporting continual unlearning even in quantized models. Evaluated on TOFU and MUSE benchmarks across three model scales, the method consistently outperforms twelve gradient-based baselines, effectively suppressing memorization while preserving model utility and maintaining robustness under quantization.
📝 Abstract
Large Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model performance, ideally approximating a model retrained from scratch without the forget set. Existing approaches aim to achieve this by updating model parameters via gradient-based methods. However, these updates are computationally expensive, lead to irreversible weight changes, and degrade when the model is quantized for deployment. A recent alternative to changing model weights is activation engineering, where activations are changed during inference to steer model behavior. Despite circumventing weight editing, naive activation steering introduces its own failure modes, as a single global steering vector applies the same intervention to every input, leading to unintended changes in model behavior. We introduce Inference-Time Unlearning via Gated Activation Redirection (GUARD-IT), a training- and gradient-free method that unlearns via input-dependent activation steering at inference time. The resulting intervention is applied as a norm-preserving rotation in the residual stream, leaving model weights untouched. Experiments on TOFU and MUSE show that GUARD-IT matches or exceeds 12 gradient-based baselines across three model scales, while being the only method to simultaneously preserve utility, suppress memorization, and avoid catastrophic collapse across all settings. GUARD-IT further supports continual unlearning without retraining, and remains effective under quantization, a scenario in which parameter-editing methods degrade.
Problem

Research questions and friction points this paper is trying to address.

machine unlearning
large language models
privacy
activation steering
inference-time intervention
Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time unlearning
activation steering
gated redirection
weight-free editing
quantization robustness
V
Vinícius Conte Turani
MALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, Brazil
O
Otávio Parraga
MALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, Brazil
J
João Vitor Boer Abitante
MALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, Brazil
K
Kristen K. Arguello
MALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, Brazil
J
Joana Pasquali
MALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, Brazil
R
Ramiro N. Barros
MALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, Brazil
Flavio du Pin Calmon
Flavio du Pin Calmon
Harvard University
Information TheoryStatistical Machine LearningFair Machine Learning
C
Christian Mattjie
MALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, Brazil
R
Rodrigo C. Barros
MALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, Brazil; Kunumi Institute, Brazil
L
Lucas S. Kupssinskü
MALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, Brazil