Mirror Mirror on the Wall, Have I Forgotten it All? A New Framework for Evaluating Machine Unlearning

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This paper addresses the problem of rigorously assessing the trustworthiness of machine unlearning in ML. We propose the first computationally secure formal definition—*computational forgetting*—which defines successful unlearning as the inability of any polynomial-time adversary to distinguish the “forgotten” model from a model trained from scratch on the residual dataset. Methodologically, we design an efficient adversary based on membership inference scores and KL divergence, and integrate information-theoretic and cryptographic security paradigms for theoretical analysis. We prove that deterministic forgetting is impossible for entropy-based learning algorithms and show that differentially private unlearning incurs substantial utility loss to achieve formal guarantees. Empirically, we demonstrate that all mainstream unlearning methods fail to satisfy computational forgetting. This work establishes the first theoretical feasibility framework for unlearning, proving that no existing method meets this rigorous definition—thereby exposing a fundamental gap in current unlearning research.

Technology Category

Application Category

📝 Abstract

Machine unlearning methods take a model trained on a dataset and a forget set, then attempt to produce a model as if it had only been trained on the examples not in the forget set. We empirically show that an adversary is able to distinguish between a mirror model (a control model produced by retraining without the data to forget) and a model produced by an unlearning method across representative unlearning methods from the literature. We build distinguishing algorithms based on evaluation scores in the literature (i.e. membership inference scores) and Kullback-Leibler divergence. We propose a strong formal definition for machine unlearning called computational unlearning. Computational unlearning is defined as the inability for an adversary to distinguish between a mirror model and a model produced by an unlearning method. If the adversary cannot guess better than random (except with negligible probability), then we say that an unlearning method achieves computational unlearning. Our computational unlearning definition provides theoretical structure to prove unlearning feasibility results. For example, our computational unlearning definition immediately implies that there are no deterministic computational unlearning methods for entropic learning algorithms. We also explore the relationship between differential privacy (DP)-based unlearning methods and computational unlearning, showing that DP-based approaches can satisfy computational unlearning at the cost of an extreme utility collapse. These results demonstrate that current methodology in the literature fundamentally falls short of achieving computational unlearning. We conclude by identifying several open questions for future work.

Problem

Research questions and friction points this paper is trying to address.

Evaluating effectiveness of machine unlearning methods

Proposing computational unlearning as a strong formal definition

Exploring limitations of current unlearning approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes computational unlearning as strong formal definition

Uses Kullback-Leibler divergence for model comparison

Links differential privacy to computational unlearning feasibility

🔎 Similar Papers

Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models