π€ AI Summary
To address the risk of large language models (LLMs) memorizing sensitive, copyrighted, or harmful content during pretraining, this paper proposes a robust machine unlearning framework. Methodologically, it introduces a novel tripartite loss function integrating masked learning, knowledge distillation, and world-knowledge consistency constraints, coupled with token-level target identification, preservation-set construction, and efficient LoRA-based fine-tuning for targeted data removal. Key contributions include: (i) the first incorporation of world-fact alignment into unlearning objectives, significantly improving both forgetting quality and model fidelity; (ii) empirical gains across benchmarksβ42% reduction in memorization rate (e.g., on Harry Potter, WMDP, and TOFU), 98.3% utility retention, and strong resilience against membership inference attacks; and (iii) establishment of a new document-level evaluation paradigm for unlearning.
π Abstract
Large language models (LLMs) trained over extensive corpora risk memorizing sensitive, copyrighted, or toxic content. To address this, we propose OBLIVIATE, a robust unlearning framework that removes targeted data while preserving model utility. The framework follows a structured process: extracting target tokens, building retain sets, and fine-tuning with a tailored loss function comprising three components -- masking, distillation, and world fact. Using low-rank adapters (LoRA), it ensures efficiency without compromising unlearning quality. We conduct experiments on multiple datasets, including the Harry Potter series, WMDP, and TOFU, using a comprehensive suite of metrics: forget quality (new document-level memorization score), model utility, and fluency. Results demonstrate its effectiveness in resisting membership inference attacks, minimizing the impact on retained data, and maintaining robustness across diverse scenarios.