Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models

📅 2025-06-14

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses three key challenges in continual learning for vision-language models (VLMs): low parameter efficiency, high memory overhead, and optimization instability. To this end, it introduces zeroth-order (ZO) optimization—systematically replacing conventional first-order (FO) methods for the first time in VLM continual learning. The proposed approach features modality-selective ZO (activated exclusively in either the visual or linguistic branch) and a layer-wise alternating ZO/FO optimization paradigm. Crucially, the authors identify and model inter-modal disparities in ZO perturbation variance, leading to a gradient-sign normalization mechanism with modality-specific constraints. Evaluated on four standard continual learning benchmarks, the method achieves state-of-the-art performance while reducing memory consumption by 89.1%. It further enhances optimization robustness and long-term generalization. This work establishes a novel, efficient, and stable paradigm for continual learning in VLMs.

Technology Category

Application Category

📝 Abstract

Continual learning in vision-language models (VLMs) faces critical challenges in balancing parameter efficiency, memory consumption, and optimization stability. While First-Order (FO) optimization (e.g., SGD) dominate current approaches, their deterministic gradients often trap models in suboptimal local minima and incur substantial memory overhead. This paper pioneers a systematic exploration of Zeroth-Order (ZO) optimization for vision-language continual learning (VLCL). We first identify the incompatibility of naive full-ZO adoption in VLCL due to modality-specific instability. To resolve this, we selectively applying ZO to either vision or language modalities while retaining FO in the complementary branch. Furthermore, we develop a layer-wise optimization paradigm that interleaves ZO and FO across network layers, capitalizing on the heterogeneous learning dynamics of shallow versus deep representations. A key theoretical insight reveals that ZO perturbations in vision branches exhibit higher variance than language counterparts, prompting a gradient sign normalization mechanism with modality-specific perturbation constraints. Extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art performance, reducing memory consumption by 89.1% compared to baselines. Code will be available upon publication.

Problem

Research questions and friction points this paper is trying to address.

Balancing parameter efficiency and memory in continual learning

Overcoming suboptimal local minima in vision-language models

Reducing memory consumption in zeroth-order optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective ZO optimization for vision or language

Layer-wise interleaving of ZO and FO

Gradient sign normalization for modality variance

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models