Beyond CLIP Generalization: Against Forward&Backward Forgetting Adapter for Continual Learning of Vision-Language Models

πŸ“… 2025-05-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the degradation of zero-shot capability and insufficient generalization in multi-task incremental learning (MTIL) of vision-language models (VLMs), this paper proposes Forward/Backward Forgetting-Adversarial Adapters (FBA). The forward adapter preserves cross-task zero-shot recognition via task-invariant semantic disentanglement and dual-path CLIP feature alignment, while the backward adapter enhances few-shot adaptability and mitigates catastrophic forgetting. Built upon the CLIP architecture, FBA integrates incremental vision-language alignment fine-tuning with feature disentanglement. Evaluated on a multi-domain few-shot incremental benchmark, FBA significantly outperforms state-of-the-art methods. Notably, its zero-shot transfer performance surpasses that of the original CLIP model, demonstrating a substantive improvement in generalization.

Technology Category

Application Category

πŸ“ Abstract
This study aims to address the problem of multi-domain task incremental learning~(MTIL), which requires that vision-language models~(VLMs) continuously acquire new knowledge while maintaining their inherent zero-shot recognition capability. Existing paradigms delegate the testing of unseen-domain samples to the original CLIP, which only prevents the degradation of the model's zero-shot capability but fails to enhance the generalization of the VLM further. To this end, we propose a novel MTIL framework, named AFA, which comprises two core modules: (1) an against forward-forgetting adapter that learns task-invariant information for each dataset in the incremental tasks to enhance the zero-shot recognition ability of VLMs; (2) an against backward-forgetting adapter that strengthens the few-shot learning capability of VLMs while supporting incremental learning. Extensive experiments demonstrate that the AFA method significantly outperforms existing state-of-the-art approaches, especially in few-shot MTIL tasks, and surpasses the inherent zero-shot performance of CLIP in terms of transferability. The code is provided in the Supplementary Material.
Problem

Research questions and friction points this paper is trying to address.

Enhance zero-shot recognition in continual learning for vision-language models
Prevent forward and backward forgetting in multi-domain incremental tasks
Improve few-shot learning capability while maintaining zero-shot performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Against forward-forgetting adapter enhances zero-shot recognition
Against backward-forgetting adapter boosts few-shot learning
AFA framework outperforms CLIP in transferability
πŸ”Ž Similar Papers
No similar papers found.