A Certified Unlearning Approach without Access to Source Data

๐Ÿ“… 2025-06-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the machine unlearning problem under the challenging setting of *no access to original training data*. We propose a novel, Wasserstein-distance-driven noise calibration method leveraging a synthetically generated proxy dataset, achieving the first *provably correct unlearning* without source data: post-unlearning model behavior is theoretically guaranteed to be indistinguishable from that of a model never trained on the target samples. Our framework integrates proxy-data modeling, differentially private noise calibration, rigorous error-bound analysis, and an empirical tuning strategy. Extensive evaluation across multiple real-world and synthetic benchmarks demonstrates >98% unlearning success rate, <1.2% degradation in model accuracy, and a 3.7ร— improvement in privacy guarantee strength over baseline methodsโ€”thereby simultaneously ensuring strong privacy protection and high model utility.

Technology Category

Application Category

๐Ÿ“ Abstract
With the growing adoption of data privacy regulations, the ability to erase private or copyrighted information from trained models has become a crucial requirement. Traditional unlearning methods often assume access to the complete training dataset, which is unrealistic in scenarios where the source data is no longer available. To address this challenge, we propose a certified unlearning framework that enables effective data removal final{without access to the original training data samples}. Our approach utilizes a surrogate dataset that approximates the statistical properties of the source data, allowing for controlled noise scaling based on the statistical distance between the two. updated{While our theoretical guarantees assume knowledge of the exact statistical distance, practical implementations typically approximate this distance, resulting in potentially weaker but still meaningful privacy guarantees.} This ensures strong guarantees on the model's behavior post-unlearning while maintaining its overall utility. We establish theoretical bounds, introduce practical noise calibration techniques, and validate our method through extensive experiments on both synthetic and real-world datasets. The results demonstrate the effectiveness and reliability of our approach in privacy-sensitive settings.
Problem

Research questions and friction points this paper is trying to address.

Certified unlearning without original training data access
Removing private data using surrogate dataset approximation
Balancing privacy guarantees and model utility post-unlearning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Certified unlearning without source data access
Uses surrogate dataset for statistical approximation
Noise scaling based on statistical distance
๐Ÿ”Ž Similar Papers
No similar papers found.