🤖 AI Summary
This study addresses continual learning for vision-language models (VLMs) under non-stationary data streams, targeting three interrelated challenges: catastrophic forgetting, cross-modal feature drift, and degradation of zero-shot generalization. To this end, we propose the first problem-driven taxonomy for VLM continual learning, identifying three fundamental failure modes. Our method introduces a unified framework integrating multimodal replay, cross-modal regularization, and parameter-efficient adaptation—specifically low-rank updates—augmented by implicit memory mechanisms and a modular architecture to preserve modality alignment and generalization. Contributions include: (i) the first systematic survey and empirical evaluation of VLM continual learning; (ii) an open-source benchmark repository with standardized protocols; (iii) a critical analysis revealing key limitations in current evaluation practices; and (iv) identification of two promising future directions—continual pretraining and compositional zero-shot learning. (149 words)
📝 Abstract
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. This survey offers the first focused and systematic review of continual learning for VLMs (VLM-CL). We begin by identifying the three core failure modes that degrade performance in VLM-CL. Based on these, we propose a challenge-driven taxonomy that maps solutions to their target problems: (1) extit{Multi-Modal Replay Strategies} address cross-modal drift through explicit or implicit memory mechanisms; (2) extit{Cross-Modal Regularization} preserves modality alignment during updates; and (3) extit{Parameter-Efficient Adaptation} mitigates parameter interference with modular or low-rank updates. We further analyze current evaluation protocols, datasets, and metrics, highlighting the need for better benchmarks that capture VLM-specific forgetting and compositional generalization. Finally, we outline open problems and future directions, including continual pre-training and compositional zero-shot learning. This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.