🤖 AI Summary
This work addresses the performance degradation commonly observed when directly applying the Muon optimizer to fine-tune Adam-pretrained models, a phenomenon attributed to optimizer mismatch. Through controlled experiments, the study reveals that Muon’s aggressive parameter updates tend to disrupt pretrained knowledge, leading to catastrophic forgetting. To mitigate this issue, the authors propose leveraging parameter-efficient fine-tuning methods—specifically, Low-Rank Adaptation (LoRA)—to constrain update magnitudes. Empirical results across both language and vision tasks demonstrate that LoRA substantially narrows the performance gap between Muon and Adam under full fine-tuning settings. This work provides the first systematic analysis of how optimizer mismatch affects transfer learning and establishes that low-rank adaptation effectively bridges the fine-tuning performance disparity between distinct optimizers.
📝 Abstract
Muon has emerged as an efficient alternative to Adam for pretraining, yet remains underused for fine-tuning. A key obstacle is that most open models are pretrained with Adam, and naively switching to Muon for fine-tuning leads to degraded performance due to an optimizer mismatch. We investigate this mismatch through controlled experiments and relate it to the distinct implicit biases of Adam and Muon. We provide evidence that the mismatch disrupts pretrained knowledge, and that this disruption scales with update strength. This leads us to hypothesize that constraining updates should mitigate the mismatch. We validate this with LoRA: across language and vision tasks, LoRA reduces the performance gap between Adam and Muon observed under full fine-tuning. Studies on LoRA rank, catastrophic forgetting, and LoRA variants further confirm that mismatch severity correlates with update strength. These results shed light on how optimizer mismatch affects fine-tuning and how it can be mitigated. Our code is available at https://github.com/XingyuQu/muon-finetune.