Similarity-Aware Mixture-of-Experts for Data-Efficient Continual Learning

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenges of inefficient knowledge transfer and negative transfer in continual learning scenarios characterized by scarce data and arbitrarily overlapping tasks. To this end, the authors propose an adaptive mixture-of-experts framework built upon pretrained models. The approach employs incremental global pooling to reduce noise in prompt associations and introduces instance-level prompt masking to dynamically distinguish in-distribution from out-of-distribution samples. Furthermore, it incrementally constructs a task-similarity-aware mechanism to guide the dynamic expansion of the prompt space and facilitate effective knowledge reuse. Experimental results demonstrate that the proposed method significantly improves sample efficiency across varying data scales and task similarity settings, while maintaining strong generalization performance and training stability.

Technology Category

Application Category

📝 Abstract

Machine learning models often need to adapt to new data after deployment due to structured or unstructured real-world dynamics. The Continual Learning (CL) framework enables continuous model adaptation, but most existing approaches either assume each task contains sufficiently many data samples or that the learning tasks are non-overlapping. In this paper, we address the more general setting where each task may have a limited dataset, and tasks may overlap in an arbitrary manner without a priori knowledge. This general setting is substantially more challenging for two reasons. On the one hand, data scarcity necessitates effective contextualization of general knowledge and efficient knowledge transfer across tasks. On the other hand, unstructured task overlapping can easily result in negative knowledge transfer. To address the above challenges, we propose an adaptive mixture-of-experts (MoE) framework over pre-trained models that progressively establishes similarity awareness among tasks. Our design contains two innovative algorithmic components: incremental global pooling and instance-wise prompt masking. The former mitigates prompt association noise through gradual prompt introduction over time. The latter decomposes incoming task samples into those aligning with current prompts (in-distribution) and those requiring new prompts (out-of-distribution). Together, our design strategically leverages potential task overlaps while actively preventing negative mutual interference in the presence of per-task data scarcity. Experiments across varying data volumes and inter-task similarity show that our method enhances sample efficiency and is broadly applicable.

Problem

Research questions and friction points this paper is trying to address.

Continual Learning

Data-Efficient

Task Overlap

Negative Transfer

Limited Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Similarity-Aware Mixture-of-Experts

Continual Learning

Prompt Masking