🤖 AI Summary
Existing knowledge editing methods for large language models suffer from a fundamental limitation in “target fact forgetting”: the target fact can often be rederived via multi-step reasoning using retained knowledge and logical inference. Method: We propose “deep unlearning”—a new paradigm that not only removes the target fact but also disrupts its reconstructability through multi-hop reasoning. We introduce Eval-DU, the first semi-synthetic benchmark supporting multi-step reasoning evaluation, augmented with real-world MQuAKE data. We define three quantitative metrics: Success-DU (deep unlearning success rate), Recall (preservation of non-target knowledge), and Accuracy (output consistency). Evaluation integrates chain-of-thought analysis and output consistency verification. Results: Extensive experiments reveal that state-of-the-art methods consistently fail—either failing to achieve deep unlearning or excessively damaging unrelated knowledge—validating the need for dedicated algorithms. This work establishes a new benchmark and theoretical foundation for controllable knowledge editing.
📝 Abstract
Machine unlearning has emerged as an important component in developing safe and trustworthy models. Prior work on fact unlearning in LLMs has mostly focused on removing a specified target fact robustly, but often overlooks its deductive connections to other knowledge. We propose a new setting for fact unlearning, deep unlearning, where the goal is not only to remove a target fact but also to prevent it from being deduced via retained knowledge in the LLM and logical reasoning. We propose three novel metrics: Success-DU and Recall to measure unlearning efficacy, and Accuracy to measure the remainder model utility. To benchmark this setting, we leverage both (1) an existing real-world knowledge dataset, MQuAKE, that provides one-step deduction instances, and (2) newly construct a novel semi-synthetic dataset, Eval-DU, that allows multiple steps of realistic deductions among synthetic facts. Experiments reveal that current methods struggle with deep unlearning: they either fail to deeply unlearn, or excessively remove unrelated facts. Our results suggest that targeted algorithms may have to be developed for robust/deep fact unlearning in LLMs.