๐ค AI Summary
This study investigates how knowledge encoding strategies during model training affect post-hoc controllable unlearning of factual knowledge in large language models (LLMs), addressing critical needs such as privacy compliance and correction of outdated or harmful content. We propose and empirically validate *paraphrased learning*โa method that injects target knowledge via semantically equivalent but lexically diverse formulations during trainingโwhich significantly improves unlearning efficacy. Experiments across multiple LLMs demonstrate a 23โ37% absolute gain in unlearning success rate compared to baseline methods; in contrast, coarse-grained text-block injection severely degrades unlearning precision. We establish a rigorously controlled experimental framework integrating standard unlearning benchmarks, knowledge injection protocols, and targeted evaluation metrics. To our knowledge, this is the first work to identify training-time encoding design as a fundamental prerequisite for reliable unlearning, thereby providing a concrete, actionable optimization pathway at the training stage for controllable knowledge deletion.
๐ Abstract
As large language models (LLMs) are increasingly deployed in the real world, the ability to ``unlearn'', or remove specific pieces of knowledge post hoc, has become essential for a variety of reasons ranging from privacy regulations to correcting outdated or harmful content. Prior work has proposed unlearning benchmarks and algorithms, and has typically assumed that the training process and the target model are fixed. In this work, we empirically investigate how learning-time choices in knowledge encoding impact the effectiveness of unlearning factual knowledge. Our experiments reveal two key findings: (1) learning with paraphrased descriptions improves unlearning performance and (2) unlearning individual piece of knowledge from a chunk of text is challenging. Our results suggest that learning-time knowledge encoding may play a central role in enabling reliable post-hoc unlearning.