🤖 AI Summary
This work addresses three key challenges in continual learning for vision-language models (VLMs): low parameter efficiency, high memory overhead, and optimization instability. To this end, it introduces zeroth-order (ZO) optimization—systematically replacing conventional first-order (FO) methods for the first time in VLM continual learning. The proposed approach features modality-selective ZO (activated exclusively in either the visual or linguistic branch) and a layer-wise alternating ZO/FO optimization paradigm. Crucially, the authors identify and model inter-modal disparities in ZO perturbation variance, leading to a gradient-sign normalization mechanism with modality-specific constraints. Evaluated on four standard continual learning benchmarks, the method achieves state-of-the-art performance while reducing memory consumption by 89.1%. It further enhances optimization robustness and long-term generalization. This work establishes a novel, efficient, and stable paradigm for continual learning in VLMs.
📝 Abstract
Continual learning in vision-language models (VLMs) faces critical challenges in balancing parameter efficiency, memory consumption, and optimization stability. While First-Order (FO) optimization (e.g., SGD) dominate current approaches, their deterministic gradients often trap models in suboptimal local minima and incur substantial memory overhead. This paper pioneers a systematic exploration of Zeroth-Order (ZO) optimization for vision-language continual learning (VLCL). We first identify the incompatibility of naive full-ZO adoption in VLCL due to modality-specific instability. To resolve this, we selectively applying ZO to either vision or language modalities while retaining FO in the complementary branch. Furthermore, we develop a layer-wise optimization paradigm that interleaves ZO and FO across network layers, capitalizing on the heterogeneous learning dynamics of shallow versus deep representations. A key theoretical insight reveals that ZO perturbations in vision branches exhibit higher variance than language counterparts, prompting a gradient sign normalization mechanism with modality-specific perturbation constraints. Extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art performance, reducing memory consumption by 89.1% compared to baselines. Code will be available upon publication.