🤖 AI Summary
This work proposes a statistical unlearning framework under a general loss function to efficiently implement machine unlearning, thereby complying with data deletion regulations and mitigating the impact of contaminated data. Focusing on squared loss, the authors develop Unlearning Least Squares (ULS), which accurately estimates the optimal parameters for the remaining data using only a pre-trained model, the samples to be forgotten, and a small amount of retained data. Theoretical analysis establishes the minimax optimality of ULS, showing that its estimation error decomposes into an ideal term and an “unlearning cost” dependent on the fraction of data removed and model misspecification. This framework enables asymptotically efficient inference without full retraining. Empirical results demonstrate that ULS achieves performance nearly matching full retraining while substantially reducing data access requirements.
📝 Abstract
There is a growing demand for efficient data removal to comply with regulations like the GDPR and to mitigate the influence of biased or corrupted data. This has motivated the field of machine unlearning, which aims to eliminate the influence of specific data subsets without the cost of full retraining. In this work, we propose a statistical framework for machine unlearning with generic loss functions and establish theoretical guarantees. For squared loss, especially, we develop Unlearning Least Squares (ULS) and establish its minimax optimality for estimating the model parameter of remaining data when only the pre-trained estimator, forget samples, and a small subsample of the remaining data are available. Our results reveal that the estimation error decomposes into an oracle term and an unlearning cost determined by the forget proportion and the forget model bias. We further establish asymptotically valid inference procedures without requiring full retraining. Numerical experiments and real-data applications demonstrate that the proposed method achieves performance close to retraining while requiring substantially less data access.