🤖 AI Summary
Visual language models (VLMs) suffer severe degradation in zero-shot classification performance under incremental learning—especially when downstream tasks diverge significantly from pretraining domains. To address this, we propose Modular Embedding Recomposition (MER), a framework that preserves zero-shot capability via dynamic, retrieval-augmented prototype synthesis. MER constructs a composable library of lightweight text-specialized expert modules; instead of fine-tuning the backbone, it retrieves and synthesizes category prototypes on-demand during inference, enabling cross-task knowledge reuse. Crucially, MER is parameter-efficient—only the experts are trained—and compatible with both Class-Incremental Learning (Class-IL) and Multi-Task Incremental Learning (MTIL) protocols. Evaluated across 14 heterogeneous datasets, MER consistently improves zero-shot accuracy by an average of +4.2% while retaining the model’s original generalization capacity. The implementation is publicly available.
📝 Abstract
The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.