Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Current LLM unlearning evaluation overrelies on token-level metrics (e.g., accuracy, perplexity), conflating “information hiding” with genuine erasure—since minor fine-tuning often restores original behavior, exposing fundamental unreliability. Method: We propose the first representation-level unlearning evaluation framework, leveraging PCA similarity, centered kernel alignment (CKA), and Fisher information to rigorously distinguish reversible unlearning (token-level performance collapse despite preserved representations) from irreversible unlearning (deep representational degradation). We theoretically show that shallow weight perturbations induce spurious forgetting signals. Contribution/Results: Empirically, we demonstrate that task type and hyperparameters critically modulate reversibility. Our framework is validated across six unlearning algorithms, three task domains (text, code, mathematics), and two open-source LLMs. We publicly release a unified analytical toolkit to support reproducible, representation-aware unlearning assessment.

Technology Category

Application Category

📝 Abstract

Unlearning in large language models (LLMs) is intended to remove the influence of specific data, yet current evaluations rely heavily on token-level metrics such as accuracy and perplexity. We show that these metrics can be misleading: models often appear to forget, but their original behavior can be rapidly restored with minimal fine-tuning, revealing that unlearning may obscure information rather than erase it. To diagnose this phenomenon, we introduce a representation-level evaluation framework using PCA-based similarity and shift, centered kernel alignment, and Fisher information. Applying this toolkit across six unlearning methods, three domains (text, code, math), and two open-source LLMs, we uncover a critical distinction between reversible and irreversible forgetting. In reversible cases, models suffer token-level collapse yet retain latent features; in irreversible cases, deeper representational damage occurs. We further provide a theoretical account linking shallow weight perturbations near output layers to misleading unlearning signals, and show that reversibility is modulated by task type and hyperparameters. Our findings reveal a fundamental gap in current evaluation practices and establish a new diagnostic foundation for trustworthy unlearning in LLMs. We provide a unified toolkit for analyzing LLM representation changes under unlearning and relearning: https://github.com/XiaoyuXU1/Representational_Analysis_Tools.git.

Problem

Research questions and friction points this paper is trying to address.

Evaluating machine unlearning effectiveness in LLMs

Distinguishing reversible vs irreversible forgetting in models

Addressing misleading token-level unlearning metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces PCA-based representation-level evaluation framework

Distinguishes reversible and irreversible forgetting in LLMs

Links shallow weight perturbations to unlearning signals

🔎 Similar Papers

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning