🤖 AI Summary
This work addresses a critical gap in current machine unlearning methods, which predominantly rely on output-layer metrics and fail to discern whether sensitive information is genuinely erased or merely suppressed at the representation level. To resolve this ambiguity, the authors propose a recovery-based analytical framework that, for the first time, distinguishes between “suppression” and “deletion” of information in intermediate representations. By leveraging sparse autoencoders to identify class-specific features in hidden layers and applying feature-level interventions during inference, the framework quantitatively measures residual information. Evaluations across 12 state-of-the-art unlearning methods on image classification tasks reveal that most—including fine-tuning of pretrained models—only suppress rather than delete semantic features, as their representations remain highly recoverable. This finding exposes a significant blind spot in existing unlearning evaluation protocols and establishes a new criterion for assessing privacy guarantees.
📝 Abstract
As pretrained models are increasingly shared on the web, ensuring that models can forget or delete sensitive, copyrighted, or private information upon request has become crucial. Machine unlearning has been proposed to address this challenge. However, current evaluations for unlearning methods rely on output-based metrics, which cannot verify whether information is completely deleted or merely suppressed at the representation level, where suppression is insufficient for true unlearning. To address this gap, we propose a novel restoration-based analysis framework that uses Sparse Autoencoders to identify class-specific expert features in intermediate layers and applies inference-time steering to quantitatively distinguish between suppression and deletion. Applying our framework to 12 major unlearning methods in image classification tasks, we find that most methods achieve high restoration rates of unlearned information, indicating that they only suppress information at the decision-boundary level, while preserving semantic features in intermediate representations. Notably, even retraining from pretrained checkpoints shows high restoration, revealing that robust semantic features inherited from pretraining are not removed by retraining. These results demonstrate that representation-level retention poses significant risks overlooked by output-based metrics, highlighting the need for new unlearning evaluation criteria. We propose new evaluation guidelines that prioritize representation-level verification, especially for privacy-critical applications in the era of pre-trained models.