🤖 AI Summary
Existing LLM unlearning evaluation metrics suffer from limited practicality, accuracy, and robustness. To address this, we propose DCUE, a distribution-calibration-based unlearning evaluation framework. DCUE identifies semantically critical tokens in model outputs, corrects confidence distribution biases via unsupervised calibration on a validation set—without requiring human annotations or strong modeling assumptions—and quantifies pre- and post-unlearning distributional shifts using the Kolmogorov–Smirnov test. Our method explicitly accounts for semantic token importance, unlike conventional metrics such as accuracy drop or KL divergence, which are noise-sensitive and agnostic to semantics. Experiments demonstrate that DCUE significantly improves evaluation sensitivity and reliability, effectively distinguishing the efficacy of diverse unlearning algorithms. By providing an interpretable, reproducible, and assumption-light benchmark, DCUE advances trustworthy LLM unlearning assessment.
📝 Abstract
This paper analyzes the limitations of existing unlearning evaluation metrics in terms of practicality, exactness, and robustness in real-world LLM unlearning scenarios. To overcome these limitations, we propose a new metric called Distribution Correction-based Unlearning Evaluation (DCUE). It identifies core tokens and corrects distributional biases in their confidence scores using a validation set. The evaluation results are quantified using the Kolmogorov-Smirnov test. Experimental results demonstrate that DCUE overcomes the limitations of existing metrics, which also guides the design of more practical and reliable unlearning algorithms in the future.