Modular Embedding Recomposition for Incremental Learning

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual language models (VLMs) suffer severe degradation in zero-shot classification performance under incremental learning—especially when downstream tasks diverge significantly from pretraining domains. To address this, we propose Modular Embedding Recomposition (MER), a framework that preserves zero-shot capability via dynamic, retrieval-augmented prototype synthesis. MER constructs a composable library of lightweight text-specialized expert modules; instead of fine-tuning the backbone, it retrieves and synthesizes category prototypes on-demand during inference, enabling cross-task knowledge reuse. Crucially, MER is parameter-efficient—only the experts are trained—and compatible with both Class-Incremental Learning (Class-IL) and Multi-Task Incremental Learning (MTIL) protocols. Evaluated across 14 heterogeneous datasets, MER consistently improves zero-shot accuracy by an average of +4.2% while retaining the model’s original generalization capacity. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.
Problem

Research questions and friction points this paper is trying to address.

Enhancing zero-shot capabilities in continual learning
Improving classification on unseen classes without adaptation
Modular framework for expert composition in VLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular framework trains multiple textual experts
Composes retrieved experts to synthesize refined prototypes
Enhances zero-shot capabilities through expert recomposition
🔎 Similar Papers
No similar papers found.