Position: LLM Unlearning Benchmarks are Weak Measures of Progress

📅 2024-10-03

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing LLM unlearning evaluation benchmarks suffer from critical flaws: implicit dependencies between unlearning and retention tasks, and ambiguous target definitions—leading to overfitting to test queries, inflated retention scores, and underestimated residual information risks. This work is the first to systematically identify this fundamental vulnerability and proposes a robust evaluation paradigm. Our diagnostic framework integrates controlled perturbation injection, dependency modeling, query sensitivity analysis, and multi-dimensional performance degradation metrics. Empirical evaluation across multiple mainstream benchmarks reveals that “successful unlearning” reported by prior methods still permits recovery of sensitive information, while actual retention-task performance degrades by over threefold compared to original reports. These findings demonstrate the unreliability of current benchmarks. Our methodology establishes a principled foundation for trustworthy unlearning assessment, enabling rigorous, generalizable, and interpretable evaluation of model forgetting behavior.

Technology Category

Application Category

📝 Abstract

Unlearning methods have the potential to improve the privacy and safety of large language models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning research community has increasingly turned toward empirical benchmarks to assess the effectiveness of such methods. In this paper, we find that existing benchmarks provide an overly optimistic and potentially misleading view on the effectiveness of candidate unlearning methods. By introducing simple, benign modifications to a number of popular benchmarks, we expose instances where supposedly unlearned information remains accessible, or where the unlearning process has degraded the model's performance on retained information to a much greater extent than indicated by the original benchmark. We identify that existing benchmarks are particularly vulnerable to modifications that introduce even loose dependencies between the forget and retain information. Further, we show that ambiguity in unlearning targets in existing benchmarks can easily lead to the design of methods that overfit to the given test queries. Based on our findings, we urge the community to be cautious when interpreting benchmark results as reliable measures of progress, and we provide several recommendations to guide future LLM unlearning research.

Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks overestimate LLM unlearning effectiveness

Benchmarks fail when forget-retain info dependencies exist

Ambiguous unlearning targets cause method overfitting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exposing flaws in LLM unlearning benchmarks

Identifying vulnerabilities in forget-retain dependencies

Recommending cautious interpretation of benchmark results

🔎 Similar Papers

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning