🤖 AI Summary
This work challenges the conventional assumption that performance gains from earlier fine-tuning stages necessarily propagate to subsequent reuse stages. It reveals that excessive fine-tuning—especially prolonged full-parameter or LoRA-based adaptation—degrades expert model reusability in upcycling scenarios, causing significant accuracy drops when integrated into multitask systems or repurposed as MoE layers, primarily due to hard-example overfitting and catastrophic forgetting during merging.
Method: We propose a task-adaptive aggressive early-stopping strategy, informed by gradient analysis and dynamic hard-example tracking.
Contribution/Results: We are the first to identify the counterintuitive “the better the fine-tuning, the worse the reuse” phenomenon. Our method improves model upgrade reusability accuracy by 3.2–7.8% across multiple benchmarks and substantially enhances both generalization and stability of LoRA-to-MoE migration.
📝 Abstract
Modern deep learning is increasingly characterized by the use of open-weight foundation models that can be fine-tuned on specialized datasets. This has led to a proliferation of expert models and adapters, often shared via platforms like HuggingFace and AdapterHub. To leverage these resources, numerous model upcycling methods have emerged, enabling the reuse of fine-tuned models in multi-task systems. A natural pipeline has thus formed to harness the benefits of transfer learning and amortize sunk training costs: models are pre-trained on general data, fine-tuned on specific tasks, and then upcycled into more general-purpose systems. A prevailing assumption is that improvements at one stage of this pipeline propagate downstream, leading to gains at subsequent steps. In this work, we challenge that assumption by examining how expert fine-tuning affects model upcycling. We show that long fine-tuning of experts that optimizes for their individual performance leads to degraded merging performance, both for fully fine-tuned and LoRA-adapted models, and to worse downstream results when LoRA adapters are upcycled into MoE layers. We trace this degradation to the memorization of a small set of difficult examples that dominate late fine-tuning steps and are subsequently forgotten during merging. Finally, we demonstrate that a task-dependent aggressive early stopping strategy can significantly improve upcycling performance.