LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Addressing the fragmented landscape and ambiguous evaluation criteria in large language model (LLM) machine unlearning research, this paper proposes the first holistic evaluation framework encompassing effectiveness, utility preservation, and robustness. We establish a taxonomy of twelve state-of-the-art unlearning methods, introduce open-ended question-answering (Open-QA) metrics to overcome the insensitivity of traditional multiple-choice question (MCQ) benchmarks to generative degradation, and pioneer a fine-grained robustness analysis that exposes differential vulnerabilities under relearning and fine-tuning attacks. Using the WMDP benchmark, we systematically evaluate three representative paradigms: divergence-driven optimization, representation misalignment, and rejection-based targeted unlearning. Empirical results reveal that MCQ metrics overestimate unlearning efficacy, whereas Open-QA more accurately reflects generative performance loss; moreover, all method families exhibit a fundamental utility–unlearning (UE–UT) trade-off.

Technology Category

Application Category

📝 Abstract

Machine unlearning for large language models (LLMs) aims to remove undesired data, knowledge, and behaviors (e.g., for safety, privacy, or copyright) while preserving useful model capabilities. Despite rapid progress over the past two years, research in LLM unlearning remains fragmented, with limited clarity on what constitutes effective unlearning and how it should be rigorously evaluated. In this work, we present a principled taxonomy of twelve recent stateful unlearning methods, grouped into three methodological families: divergence-driven optimization, representation misalignment, and rejection-based targeted unlearning. Building on this taxonomy, we revisit the evaluation of unlearning effectiveness (UE), utility retention (UT), and robustness (Rob), focusing on the WMDP benchmark. Our analysis shows that current evaluations, dominated by multiple-choice question (MCQ) accuracy, offer only a narrow perspective, often overstating success while overlooking the model's actual generation behavior. To address this gap, we introduce open question-answering (Open-QA) metrics that better capture generative performance and reveal the inherent UE-UT tradeoff across method families. Furthermore, we demonstrate that robustness requires finer-grained analysis: for example, vulnerabilities differ substantially between in-domain relearning and out-of-domain fine-tuning, even though both fall under model-level attacks. Through this study, we hope to deliver a full-stack revisit of LLM unlearning and actionable guidance for designing and evaluating future methods.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM unlearning effectiveness and utility retention tradeoffs

Addressing fragmented research on unlearning methods and evaluation metrics

Developing robust metrics to assess generative behavior after unlearning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Taxonomy of twelve stateful unlearning methods

Open-QA metrics for generative performance evaluation

Finer-grained analysis of unlearning robustness

🔎 Similar Papers

No similar papers found.

Authors to Follow