🤖 AI Summary
This work addresses a critical gap in existing machine unlearning methods, which overlook the influence of factual salience and the origin of knowledge—whether acquired during pretraining or fine-tuning—on unlearning efficacy. To bridge this gap, the authors introduce DUAL, a novel benchmark comprising 28.6k Wikidata triples annotated for factual salience using Wikipedia links and large language model–based scoring. Built upon this benchmark, they establish the first fine-grained unlearning evaluation framework that explicitly incorporates factual salience. Through comprehensive comparative experiments and stability analyses, the study reveals fundamental differences in how knowledge from pretraining versus fine-tuning responds to unlearning: models fine-tuned before unlearning exhibit smoother, more stable forgetting behavior and retain 10–50% more relevant knowledge, whereas directly unlearning pretrained models often leads to catastrophic forgetting or inadvertent relearning.
📝 Abstract
Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.