🤖 AI Summary
Machine unlearning aims to eliminate the influence of specific data on a model, with retrain equivalence—the gold standard—requiring the unlearned model to be statistically indistinguishable from one retrained without the target data.
Method: This work establishes, for the first time, a path-dependence theory of machine unlearning under multi-stage training, analyzing theoretical guarantees and empirically evaluating gradient ascent, NPO, and SimNPO on Llama and Qwen models for post-training unlearning on GSM8K.
Contribution/Results: We prove that local unlearning algorithms fundamentally fail to achieve retrain equivalence in practical settings like LLM fine-tuning due to inherent sensitivity to training order—a challenge to the validity of prevailing unlearning objectives. Experiments show over 20% accuracy variance on GSM8K across different training orders, confirming that path dependence critically degrades unlearning efficacy, concept transfer, and convergence speed. This work delineates fundamental theoretical limits of unlearning and provides foundational insights for principled algorithm design.
📝 Abstract
Machine unlearning seeks to selectively remove the "influence" of specific training data on a model's outputs. The ideal goal is Retrain Equivalence--behavior identical to a model trained from scratch on only the retained data. This goal was formulated for models trained on i.i.d. data batches, but modern pipelines often involve multi-stage training, with each stage having a distinct data distribution and objective. Examples include LLM fine-tuning for alignment, reasoning ability, etc. Our study shows via theory and experiments that this shift to multi-stage training introduces a fundamental barrier for machine unlearning. The theory indicates that the outcome of local unlearning--methods that only use gradients computed on the forget set--is path-dependent. That is, a model's behavior during unlearning is influenced by the order of its training stages during learning, making it impossible for path-oblivious algorithms to universally achieve Retrain Equivalence. We empirically demonstrate the same phenomenon in LLM post-training across Llama and Qwen models (1B to 14B) with gradient ascent, NPO, and SimNPO local unlearning algorithms. Models fine-tuned via different orderings of identical training stages diverge in behavior during unlearning, with the degradation in GSM8K accuracy after unlearning varying by over 20% across paths. We also observe that some learning paths consistently produce models that unlearn slowly. During unlearning, whether the probability mass gets squeezed into paraphrasing or alternative concepts is also path-dependent. These results consistently show that Retrain Equivalence is an ill-posed target for local unlearning algorithms, so long as the target models are trained in stages. In situations where access to models' training histories is hard, the current work calls for rethinking the definition and desiderata of machine unlearning.