🤖 AI Summary
This work exposes a critical failure of current large language model (LLM) unlearning methods under probabilistic decoding: although they perform well under greedy decoding evaluation, they consistently leak forgotten knowledge under realistic sampling-based generation. To address this, we introduce leak@$k$, the first meta-evaluation metric designed specifically for real-world decoding scenarios, which systematically quantifies the probability of forgotten knowledge reemergence. We conduct a large-scale empirical assessment across three major benchmarks—TOFU, MUSE, and WMDP—evaluating state-of-the-art unlearning methods under both greedy decoding and diverse stochastic sampling strategies (e.g., top-$k$, nucleus sampling). Results demonstrate that all existing methods exhibit significant knowledge leakage under sampling, revealing severe deficiencies in their unlearning robustness. This study is the first to challenge the validity of unlearning effectiveness from the perspective of decoding mechanisms, establishing a crucial evaluation paradigm and empirical foundation for developing truly reliable LLM unlearning techniques.
📝 Abstract
Unlearning in large language models (LLMs) is critical for regulatory compliance and for building ethical generative AI systems that avoid producing private, toxic, illegal, or copyrighted content. Despite rapid progress, in this work we show that extit{almost all} existing unlearning methods fail to achieve true forgetting in practice. Specifically, while evaluations of these `unlearned'models under deterministic (greedy) decoding often suggest successful knowledge removal using standard benchmarks (as has been done in the literature), we show that sensitive information reliably resurfaces when models are sampled with standard probabilistic decoding. To rigorously capture this vulnerability, we introduce exttt{leak@$k$}, a new meta-evaluation metric that quantifies the likelihood of forgotten knowledge reappearing when generating $k$ samples from the model under realistic decoding strategies. Using three widely adopted benchmarks, TOFU, MUSE, and WMDP, we conduct the first large-scale, systematic study of unlearning reliability using our newly defined exttt{leak@$k$} metric. Our findings demonstrate that knowledge leakage persists across methods and tasks, underscoring that current state-of-the-art unlearning techniques provide only limited forgetting and highlighting the urgent need for more robust approaches to LLM unlearning.