🤖 AI Summary
In machine unlearning, forgetting updates often degrade retained knowledge; existing methods struggle to simultaneously ensure effective forgetting and stability of the retained set, and lack theoretical characterization of side effects. Method: This paper formally establishes, for the first time, a geometric decoupling condition: the retained loss remains invariant if and only if the update direction is orthogonal to the subspace spanned by gradients on the retained set. Building upon this, we propose a geometric decoupling unlearning framework that orthogonally projects the forgetting gradient onto the tangent (harmful) and normal (safe) components relative to the retained gradient subspace, applying updates only along the normal direction, and jointly optimizes forgetting and retention objectives under a trust-region constraint. Results: Evaluated on TOFU, MUSE, and WMDP benchmarks, our framework significantly enhances multiple state-of-the-art unlearning methods—improving forgetting efficacy while strictly preserving the retained set, with provable theoretical guarantees.
📝 Abstract
Machine unlearning, the removal of a training subset's influence from a deployed model, is critical for privacy preservation and model reliability, yet gradient ascent on forget samples often harms retained knowledge. Existing approaches face a persistent tradeoff between effective forgetting and preservation on the retain set. While previous methods provide useful heuristics, they often lack a formal analysis on how exactly forgetting updates harm retained knowledge, and whether the side effects can be removed with theoretical guarantees. To explore a theoretically sound and simple solution, we start from the first principle on how performance on the retain set is actually affected: a first-order analysis of the local change of the retain loss under small parameter updates during model training. We start from a crisp equivalence: the retain loss is unchanged to first order iff the update direction is orthogonal to the subspace spanned by retain gradients ("retain-invariant"). This identifies the entangled component as the tangential part of forget update within the retain-gradient subspace, and characterizes disentanglement as orthogonality. Guided by this, we propose the Geometric-disentanglement Unlearning (GU) that decomposes any candidate forget gradient update into tangential and normal components to retain space and executes only the normal component. Under a standard trust-region budget, the projected direction aligned with the raw forget gradient is optimal among all first-order retain-invariant moves, and we also derive the optimal projected direction for joint forget-retain updating objectives. Our method is plug-and-play and can be attached to existing gradient-based unlearning procedures to mitigate side effects. GU achieves consistent improvement on various methods across three benchmarks TOFU, MUSE, and WMDP.