Soup to go: mitigating forgetting during continual learning with model averaging

📅 2025-01-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To mitigate catastrophic forgetting in cross-domain continual learning, this paper proposes Sequential Fine-tuning and Averaging (SFA), a parameter-efficient method that dynamically fuses the current model with earlier checkpoints during training—without storing historical data, auxiliary parameters, or replay buffers. SFA is the first approach to embed model averaging directly into the training process, eliminating the need for gradient-level regularization or explicit rehearsal. Inspired by L2 regression, it employs a principled weight-averaging strategy over sequentially saved checkpoints. On diverse image and language continual learning benchmarks, SFA consistently outperforms baselines including Task Arithmetic, TIES Merging, WiSE-FT, EWC, and L2 regularization. Moreover, it achieves superior forward and backward transfer balance, demonstrating both computational efficiency and strong generalization across domains.

Technology Category

Application Category

📝 Abstract
In continual learning, where task data arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earlier tasks. This is especially pronounced when these tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the model has learned with minimal computational expenses? Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges currently training models with earlier checkpoints during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. In contrast, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms common merging techniques such as Task Arithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2 and Elastic Weight Consolidation. In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.
Problem

Research questions and friction points this paper is trying to address.

Continuous Learning
Catastrophic Forgetting
Task Adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential Fine-tuning and Averaging (SFA)
Reducing Catastrophic Forgetting
Efficient Model Consolidation
🔎 Similar Papers
No similar papers found.
Anat Kleiman
Anat Kleiman
Harvard University
G
G. Dziugaite
Google DeepMind
Jonathan Frankle
Jonathan Frankle
Databricks
Deep Learning
S
S. Kakade
Harvard University, Kempner Institute
Mansheej Paul
Mansheej Paul
Research Scientist, Databricks