A Simple Efficiency Incremental Learning Framework via Vision-Language Model with Nonlinear Multi-Adapters

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the challenges of low training efficiency, reliance on memory buffers for storing old data, and dependence on powerful backbone models in incremental learning. To this end, the authors propose SimE, a framework built upon the CLIP vision-language model that incorporates task-specific multi-adapter modules and optimizes their configuration through a nonlinear connectivity mechanism. The study reveals a nonlinear relationship between the number of adapters and incremental learning performance, enabling the design of a lightweight yet effective multi-adapter strategy. This approach significantly enhances zero-shot transfer capability while reducing dependency on both memory buffers and large backbones. Experimental results demonstrate that SimE outperforms conventional methods by 9.6% on TinyImageNet and surpasses existing CLIP-based approaches by 5.3% on CIFAR-100.

Technology Category

Application Category

📝 Abstract

Incremental Learning (IL) aims to learn new tasks while preserving previously acquired knowledge. Integrating the zero-shot learning capabilities of pre-trained vision-language models into IL methods has marked a significant advancement. However, these methods face three primary challenges: (1) the need for improved training efficiency; (2) reliance on a memory bank to store previous data; and (3) the necessity of a strong backbone to augment the model's capabilities. In this paper, we propose SimE, a Simple and Efficient framework that employs a vision-language model with adapters designed specifically for the IL task. We report a remarkable phenomenon: there is a nonlinear correlation between the number of adaptive adapter connections and the model's IL capabilities. While increasing adapter connections between transformer blocks improves model performance, adding more adaptive connections within transformer blocks during smaller incremental steps does not enhance, and may even degrade the model's IL ability. Extensive experimental results show that SimE surpasses traditional methods by 9.6% on TinyImageNet and outperforms other CLIP-based methods by 5.3% on CIFAR-100. Furthermore, we conduct a systematic study to enhance the utilization of the zero-shot capabilities of CLIP. We suggest replacing SimE's encoder with a CLIP model trained on larger datasets (e.g., LAION2B) and stronger architectures (e.g., ViT-L/14).

Problem

Research questions and friction points this paper is trying to address.

Incremental Learning

Training Efficiency

Memory Bank

Vision-Language Model

Knowledge Preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incremental Learning

Vision-Language Model

Adapter