Modular Embedding Recomposition for Incremental Learning

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Visual language models (VLMs) suffer severe degradation in zero-shot classification performance under incremental learning—especially when downstream tasks diverge significantly from pretraining domains. To address this, we propose Modular Embedding Recomposition (MER), a framework that preserves zero-shot capability via dynamic, retrieval-augmented prototype synthesis. MER constructs a composable library of lightweight text-specialized expert modules; instead of fine-tuning the backbone, it retrieves and synthesizes category prototypes on-demand during inference, enabling cross-task knowledge reuse. Crucially, MER is parameter-efficient—only the experts are trained—and compatible with both Class-Incremental Learning (Class-IL) and Multi-Task Incremental Learning (MTIL) protocols. Evaluated across 14 heterogeneous datasets, MER consistently improves zero-shot accuracy by an average of +4.2% while retaining the model’s original generalization capacity. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.

Problem

Research questions and friction points this paper is trying to address.

Enhancing zero-shot capabilities in continual learning

Improving classification on unseen classes without adaptation

Modular framework for expert composition in VLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular framework trains multiple textual experts

Composes retrieved experts to synthesize refined prototypes

Enhances zero-shot capabilities through expert recomposition

🔎 Similar Papers

Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning