MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning

📅 2024-07-30
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address catastrophic forgetting in fine-tuning large language models (LLMs) — particularly when pretraining data is inaccessible — this paper proposes MoFO, a lightweight, data-free optimizer. MoFO mitigates forgetting by enforcing sparse parameter updates guided by momentum magnitude, coupled with a momentum-informed greedy block coordinate descent (BCD) strategy that dynamically preserves the subset of parameters most critical to pretrained knowledge during gradient updates. Theoretical analysis establishes convergence guarantees. Empirical evaluation across multiple benchmarks demonstrates that MoFO achieves task performance on par with full-parameter fine-tuning while substantially alleviating forgetting: it improves retention of general capabilities and reduces average forgetting rate by 38%.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. Typically, LLMs are first pre-trained on large corpora and subsequently fine-tuned on task-specific datasets. However, during fine-tuning, LLMs may forget some knowledge acquired in the pre-training stage, leading to a decline in general capabilities. Existing approaches to mitigate forgetting often rely on access to pre-training data, which may be unavailable in many real-world scenarios--such as fine-tuning checkpoint-only open-source LLMs. To address this challenge, we propose a new fine-tuning algorithm termed Momentum-Filtered Optimizer (MoFO). MoFO is an extension of greedy block coordinate descent (BCD) methods: in each iteration, MoFO only updates the model parameters with the largest momentum magnitudes, while keeping all other parameters fixed. MoFO achieves similar fine-tuning performance to the default fine-tuning algorithm while effectively mitigating knowledge forgetting. We validate MoFO through rigorous convergence analysis and extensive experiments, demonstrating its effectiveness in mitigating forgetting without pre-training data.
Problem

Research questions and friction points this paper is trying to address.

Mitigating knowledge forgetting in LLM fine-tuning
Avoiding reliance on pre-training data access
Improving fine-tuning without general capability decline
Innovation

Methods, ideas, or system contributions that make the work stand out.

Momentum-Filtered Optimizer (MoFO) for fine-tuning
Updates only high-momentum parameters per iteration
Mitigates forgetting without pre-training data
🔎 Similar Papers
No similar papers found.
Y
Yupeng Chen
The Chinese University of Hong Kong, Shenzhen, China
S
Senmiao Wang
The Chinese University of Hong Kong, Shenzhen, China
Zhihang Lin
Zhihang Lin
Xiamen University & Shanghai Innovation Institute
Efficient Artificial Intelligence
Zeyu Qin
Zeyu Qin
Hong Kong University of Science and Technology
Machine LearningDeep LearningScalable OversightAI Safety
Yushun Zhang
Yushun Zhang
The Chinese University of Hong Kong, Shenzhen, China
OptimizationDeep learning
Tian Ding
Tian Ding
Shenzhen Research Institute of Big Data
R
Ruoyu Sun
The Chinese University of Hong Kong, Shenzhen, China; Shenzhen Research Institute of Big Data