🤖 AI Summary
To address the stability-plasticity dilemma in vision-language models (VLMs) arising from sequential fine-tuning in continual learning, this paper proposes a weight-alignment-based model fusion approach—the first to introduce model fusion into VLM continual learning. Our method explicitly aligns the semantic weight spaces across tasks, mitigating parameter interference during fusion and thereby simultaneously preserving knowledge stability for previously learned tasks and enabling plasticity for new tasks—without requiring experience replay or regularization. Experiments demonstrate substantial mitigation of catastrophic forgetting, enhanced robustness on multi-task sequences and high-similarity task settings, and improved cross-task generalization. The core innovation lies in replacing sequential parameter updates with an alignment-driven fusion mechanism, establishing a novel paradigm for continual learning in VLMs.
📝 Abstract
Continual learning is conventionally tackled through sequential fine-tuning, a process that, while enabling adaptation, inherently favors plasticity over the stability needed to retain prior knowledge. While existing approaches attempt to mitigate catastrophic forgetting, a bias towards recent tasks persists as they build upon this sequential nature. In this work we present a new perspective based on model merging to maintain stability while still retaining plasticity. Rather than just sequentially updating the model weights, we propose merging newly trained task parameters with previously learned ones, promoting a better balance. To maximize the effectiveness of the merging process, we propose a simple mechanism that promotes learning aligned weights with previous ones, thereby avoiding interference when merging. We evaluate this approach on large Vision-Language Models (VLMs), and demonstrate its effectiveness in reducing forgetting, increasing robustness to various task orders and similarities, and improving generalization.