🤖 AI Summary
To address catastrophic forgetting in fine-tuning large language models (LLMs) — particularly when pretraining data is inaccessible — this paper proposes MoFO, a lightweight, data-free optimizer. MoFO mitigates forgetting by enforcing sparse parameter updates guided by momentum magnitude, coupled with a momentum-informed greedy block coordinate descent (BCD) strategy that dynamically preserves the subset of parameters most critical to pretrained knowledge during gradient updates. Theoretical analysis establishes convergence guarantees. Empirical evaluation across multiple benchmarks demonstrates that MoFO achieves task performance on par with full-parameter fine-tuning while substantially alleviating forgetting: it improves retention of general capabilities and reduces average forgetting rate by 38%.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. Typically, LLMs are first pre-trained on large corpora and subsequently fine-tuned on task-specific datasets. However, during fine-tuning, LLMs may forget some knowledge acquired in the pre-training stage, leading to a decline in general capabilities. Existing approaches to mitigate forgetting often rely on access to pre-training data, which may be unavailable in many real-world scenarios--such as fine-tuning checkpoint-only open-source LLMs. To address this challenge, we propose a new fine-tuning algorithm termed Momentum-Filtered Optimizer (MoFO). MoFO is an extension of greedy block coordinate descent (BCD) methods: in each iteration, MoFO only updates the model parameters with the largest momentum magnitudes, while keeping all other parameters fixed. MoFO achieves similar fine-tuning performance to the default fine-tuning algorithm while effectively mitigating knowledge forgetting. We validate MoFO through rigorous convergence analysis and extensive experiments, demonstrating its effectiveness in mitigating forgetting without pre-training data.