CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

239K/year
🤖 AI Summary
This work addresses catastrophic forgetting in continual learning for large language models and vision-language models by proposing the CP-MoE framework, which innovatively integrates transient experts with stable experts. Transient experts capture initial knowledge from new tasks, while a consistency-preserving routing bias and a transient-guided regularization mechanism efficiently consolidate this new knowledge into stable experts, striking a balance between parameter protection and cross-task transfer. Built upon a LoRA-based mixture-of-experts architecture and augmented with a parameter merging strategy, CP-MoE significantly reduces forgetting rates and enhances zero-shot transfer performance on SuperNI and VQA v2, demonstrating strong scalability for multimodal continual learning.
📝 Abstract
Catastrophic forgetting remains a major obstacle to continual learning in large language models (LLMs) and vision--language models (VLMs). Although Mixture-of-Experts (MoE) architectures offer an efficient path to scaling, existing LoRA-based MoE continual learning methods still face a fundamental trade-off: they either isolate experts too aggressively, limiting knowledge transfer across tasks, or allow task-specific updates to overwrite important existing parameters, leading to severe forgetting. To address this, we propose CP-MoE, a continual learning framework built around a transient expert that captures early task-specific updates and guides their integration into stable experts. CP-MoE introduces a consistency-preserving routing bias, which uses the transient expert to estimate representation similarity with stable experts and steer routing towards more compatible expert selection, and a transient expert-guided regularisation mechanism, which selectively protects important historical parameters during merging. Together, these components reduce parameter interference and forgetting while preserving cross-task knowledge transfer. We validate CP-MoE on both unimodal and multimodal continual learning benchmarks with LLM-based and VLM-based MoE models. On SuperNI benchmark, spanning diverse sequential language tasks, CP-MoE achieves state-of-the-art performance and stronger zero-shot transfer to unseen tasks. On VQA v2 dataset, it scales effectively to multimodal visual reasoning, consistently reduces forgetting, and outperforms strong MoE baselines.
Problem

Research questions and friction points this paper is trying to address.

catastrophic forgetting
continual learning
Mixture-of-Experts
large language models
vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
continual learning
catastrophic forgetting
consistency-preserving routing
transient expert