🤖 AI Summary
Existing machine unlearning methods lack credible evaluation of “complete forgetting,” often failing to eliminate model memorization of true labels—leading to erroneous assessments of unlearning efficacy. This paper introduces conformal prediction to machine unlearning for the first time, proposing a falsifiable forgetting metric that rigorously tests whether the ground-truth label is excluded from the prediction confidence set. We further design a falsifiable unlearning training paradigm integrating Carlini & Wagner adversarial loss. Our approach jointly leverages membership inference attacks (MIA), uncertainty quantification, and fine-tuning for image classification. Evaluated on benchmarks including CIFAR-10, our metric reveals widespread insufficient forgetting in state-of-the-art methods. The proposed framework improves Unlearning Accuracy by 12.7% and reduces MIA success rate to <5%, significantly enhancing both the reliability and verifiability of unlearning outcomes.
📝 Abstract
Machine unlearning seeks to systematically remove specified data from a trained model, effectively achieving a state as though the data had never been encountered during training. While metrics such as Unlearning Accuracy (UA) and Membership Inference Attack (MIA) provide a baseline for assessing unlearning performance, they fall short of evaluating the completeness and reliability of forgetting. This is because the ground truth labels remain potential candidates within the scope of uncertainty quantification, leaving gaps in the evaluation of true forgetting. In this paper, we identify critical limitations in existing unlearning metrics and propose enhanced evaluation metrics inspired by conformal prediction. Our metrics can effectively capture the extent to which ground truth labels are excluded from the prediction set. Furthermore, we observe that many existing machine unlearning methods do not achieve satisfactory forgetting performance when evaluated with our new metrics. To address this, we propose an unlearning framework that integrates conformal prediction insights into Carlini&Wagner adversarial attack loss. Extensive experiments on the image classification task demonstrate that our enhanced metrics offer deeper insights into unlearning effectiveness, and that our unlearning framework significantly improves the forgetting quality of unlearning methods.