🤖 AI Summary
This study systematically investigates the decay of knowledge editing (KE) in large language models (LLMs) following fine-tuning—a critical issue for safety and practical deployment, addressing whether edits persist or vanish post-fine-tuning. Using two state-of-the-art KE methods—MEMIT and AlphaEdit—and three fine-tuning paradigms (full-parameter fine-tuning, LoRA, and DoRA), we conduct 232 experiments across five LLMs and three benchmark datasets. Our findings reveal: (1) significant and pervasive edit decay, with AlphaEdit exhibiting greater vulnerability than MEMIT; (2) fine-tuning non-edited layers induces stronger decay than full-parameter fine-tuning; and (3) a novel selective-layer fine-tuning strategy enables controllable, targeted removal of edits. This work provides the first systematic, quantitative evidence on KE robustness under fine-tuning and delivers practical techniques for evaluating edit persistence, erasing malicious edits, and conducting secure, edit-aware model adaptation.
📝 Abstract
Knowledge editing has emerged as a lightweight alternative to retraining for correcting or injecting specific facts in large language models (LLMs). Meanwhile, fine-tuning remains the default operation for adapting LLMs to new domains and tasks. Despite their widespread adoption, these two post-training interventions have been studied in isolation, leaving open a crucial question: if we fine-tune an edited model, do the edits survive? This question is motivated by two practical scenarios: removing covert or malicious edits, and preserving beneficial edits. If fine-tuning impairs edits as shown in Figure 1, current KE methods become less useful, as every fine-tuned model would require re-editing, which significantly increases the cost; if edits persist, fine-tuned models risk propagating hidden malicious edits, raising serious safety concerns. To this end, we systematically quantify edits decay after fine-tuning, investigating how fine-tuning affects knowledge editing. We evaluate two state-of-the-art editing methods (MEMIT, AlphaEdit) and three fine-tuning approaches (full-parameter, LoRA, DoRA) across five LLMs and three datasets, yielding 232 experimental configurations. Our results show that edits decay after fine-tuning, with survival varying across configurations, e.g., AlphaEdit edits decay more than MEMIT edits. Further, we propose selective-layer fine-tuning and find that fine-tuning edited layers only can effectively remove edits, though at a slight cost to downstream performance. Surprisingly, fine-tuning non-edited layers impairs more edits than full fine-tuning. Overall, our study establishes empirical baselines and actionable strategies for integrating knowledge editing with fine-tuning, and underscores that evaluating model editing requires considering the full LLM application pipeline.