Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses continual learning for vision-language models (VLMs) under non-stationary data streams, targeting three interrelated challenges: catastrophic forgetting, cross-modal feature drift, and degradation of zero-shot generalization. To this end, we propose the first problem-driven taxonomy for VLM continual learning, identifying three fundamental failure modes. Our method introduces a unified framework integrating multimodal replay, cross-modal regularization, and parameter-efficient adaptation—specifically low-rank updates—augmented by implicit memory mechanisms and a modular architecture to preserve modality alignment and generalization. Contributions include: (i) the first systematic survey and empirical evaluation of VLM continual learning; (ii) an open-source benchmark repository with standardized protocols; (iii) a critical analysis revealing key limitations in current evaluation practices; and (iv) identification of two promising future directions—continual pretraining and compositional zero-shot learning. (149 words)

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. This survey offers the first focused and systematic review of continual learning for VLMs (VLM-CL). We begin by identifying the three core failure modes that degrade performance in VLM-CL. Based on these, we propose a challenge-driven taxonomy that maps solutions to their target problems: (1) extit{Multi-Modal Replay Strategies} address cross-modal drift through explicit or implicit memory mechanisms; (2) extit{Cross-Modal Regularization} preserves modality alignment during updates; and (3) extit{Parameter-Efficient Adaptation} mitigates parameter interference with modular or low-rank updates. We further analyze current evaluation protocols, datasets, and metrics, highlighting the need for better benchmarks that capture VLM-specific forgetting and compositional generalization. Finally, we outline open problems and future directions, including continual pre-training and compositional zero-shot learning. This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.
Problem

Research questions and friction points this paper is trying to address.

Address catastrophic forgetting in vision-language models during continual learning
Mitigate cross-modal feature drift and parameter interference in VLMs
Preserve zero-shot capabilities and modality alignment in lifelong learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Modal Replay Strategies for cross-modal drift
Cross-Modal Regularization preserves modality alignment
Parameter-Efficient Adaptation mitigates interference
🔎 Similar Papers
No similar papers found.
Y
Yuyang Liu
School of Electronics Engineering and Computer Science, Peking University, Beijing, China
Q
Qiuhe Hong
School of Electronics Engineering and Computer Science, Peking University, Beijing, China
Linlan Huang
Linlan Huang
Nankai University
continual learning
Alexandra Gomez-Villa
Alexandra Gomez-Villa
Assistant Professor, Universitat Autònoma de Barcelona & Researcher, Computer Vision Center
Computer visionMachine learningVisual perception
Dipam Goswami
Dipam Goswami
Computer Vision Center, Universitat Autonoma de Barcelona
Continual LearningTransfer LearningFederated LearningVision Language Models
X
Xialei Liu
VCIP, TMCC, College of Computer Science, Nankai University, Tianjin, China
Joost van de Weijer
Joost van de Weijer
Computer Vision Center, Universitat Autònoma de Barcelona
Computer VisionDeep LearningContinual Learning
Y
Yonghong Tian
School of Electronics Engineering and Computer Science, Peking University, Beijing, China