Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study addresses the tendency of large language models to forget pretraining knowledge during supervised fine-tuning and the unclear role of optimizers in balancing learning and forgetting. The authors systematically investigate whether using the same optimizer as in pretraining—particularly full-parameter fine-tuning with AdamW—can mitigate forgetting and enhance performance. They uncover a novel “optimizer–model consistency” phenomenon: matching the pretraining optimizer yields less forgetting and comparable or superior performance relative to parameter-efficient methods like LoRA, whereas optimizers with strong memory biases, such as Muon, underperform on few-shot reasoning tasks. Through controlled experiments, synthetic language modeling, and theoretical analysis, the work elucidates how optimizers shape the loss landscape via regularization-induced activation patterns, establishing new principles for effective fine-tuning strategies.

📝 Abstract

Optimizers play an important role in both pretraining and finetuning stages when training large language models (LLMs). In this paper, we present an observation that full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff, i.e., forgetting less while achieving the same or better performance on the new task, than other optimizers and, possibly surprisingly, LoRA, during the supervised finetuning (SFT) stage. We term this phenomenon optimizer-model consistency. To better understand it, through controlled experiments and theoretical analysis, we show that: 1) optimizers can shape the models by having regularization effects on the activations, leading to different landscapes around the pretrained checkpoints; 2) in response to this regularization effect, the weight update in SFT should follow some specific structures to lower forgetting of the knowledge learned in pretraining, which can be obtained by using the same optimizer. Moreover, we specifically compare Muon and AdamW when they are employed throughout the pretraining and SFT stages and find that Muon performs worse when finetuned for reasoning tasks. With a synthetic language modeling experiment, we demonstrate that this can come from Muon's strong tendency towards rote memorization, which may hurt pattern acquisition with a small amount of data, as for SFT.

Problem

Research questions and friction points this paper is trying to address.

optimizer-model consistency

catastrophic forgetting

large language models

supervised finetuning

pretraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

optimizer-model consistency

full finetuning

forgetting reduction