Quantifying Edits Decay in Fine-tuned LLMs

📅 2025-11-08

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study systematically investigates the decay of knowledge editing (KE) in large language models (LLMs) following fine-tuning—a critical issue for safety and practical deployment, addressing whether edits persist or vanish post-fine-tuning. Using two state-of-the-art KE methods—MEMIT and AlphaEdit—and three fine-tuning paradigms (full-parameter fine-tuning, LoRA, and DoRA), we conduct 232 experiments across five LLMs and three benchmark datasets. Our findings reveal: (1) significant and pervasive edit decay, with AlphaEdit exhibiting greater vulnerability than MEMIT; (2) fine-tuning non-edited layers induces stronger decay than full-parameter fine-tuning; and (3) a novel selective-layer fine-tuning strategy enables controllable, targeted removal of edits. This work provides the first systematic, quantitative evidence on KE robustness under fine-tuning and delivers practical techniques for evaluating edit persistence, erasing malicious edits, and conducting secure, edit-aware model adaptation.

Technology Category

Application Category

📝 Abstract

Knowledge editing has emerged as a lightweight alternative to retraining for correcting or injecting specific facts in large language models (LLMs). Meanwhile, fine-tuning remains the default operation for adapting LLMs to new domains and tasks. Despite their widespread adoption, these two post-training interventions have been studied in isolation, leaving open a crucial question: if we fine-tune an edited model, do the edits survive? This question is motivated by two practical scenarios: removing covert or malicious edits, and preserving beneficial edits. If fine-tuning impairs edits as shown in Figure 1, current KE methods become less useful, as every fine-tuned model would require re-editing, which significantly increases the cost; if edits persist, fine-tuned models risk propagating hidden malicious edits, raising serious safety concerns. To this end, we systematically quantify edits decay after fine-tuning, investigating how fine-tuning affects knowledge editing. We evaluate two state-of-the-art editing methods (MEMIT, AlphaEdit) and three fine-tuning approaches (full-parameter, LoRA, DoRA) across five LLMs and three datasets, yielding 232 experimental configurations. Our results show that edits decay after fine-tuning, with survival varying across configurations, e.g., AlphaEdit edits decay more than MEMIT edits. Further, we propose selective-layer fine-tuning and find that fine-tuning edited layers only can effectively remove edits, though at a slight cost to downstream performance. Surprisingly, fine-tuning non-edited layers impairs more edits than full fine-tuning. Overall, our study establishes empirical baselines and actionable strategies for integrating knowledge editing with fine-tuning, and underscores that evaluating model editing requires considering the full LLM application pipeline.

Problem

Research questions and friction points this paper is trying to address.

Investigating whether knowledge edits survive after fine-tuning in LLMs

Quantifying how fine-tuning affects edited knowledge persistence and decay

Evaluating edit survival across different editing and fine-tuning methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantifying edits decay after fine-tuning in LLMs

Evaluating MEMIT and AlphaEdit editing methods performance

Proposing selective-layer fine-tuning to remove edits effectively

🔎 Similar Papers

The Remarkable Robustness of LLMs: Stages of Inference?