🤖 AI Summary
To address catastrophic forgetting and task interference in continual learning of large models, this paper proposes a rank-level fine-grained adaptive Mixture-of-Experts (MoE) framework. Unlike conventional LoRA-MoE approaches employing coarse-grained expert selection, our method decomposes low-rank updates into multiple independent rank-1 experts and integrates self-activating sparsity with routing inference grounded in intermediate-layer activations, enabling input-driven dynamic sparse selection. Rank pruning, activation budget control, and self-assessing routing jointly mitigate subspace interference, redundant parameter updates, and routing ambiguity. Experiments on CLIP and large language models demonstrate that the method significantly reduces forgetting rates while enhancing both forward and backward transfer capabilities. Moreover, routing stability is preserved as the number of experts scales up.
📝 Abstract
Continual learning (CL) with large pre-trained models is challenged by catastrophic forgetting and task interference. Existing LoRA-based Mixture-of-Experts (MoE) approaches mitigate forgetting by assigning and freezing task-specific adapters, but suffer from interference, redundancy, and ambiguous routing due to coarse adapter-level selection. However, this design introduces three key challenges: 1) Interference: Activating full LoRA experts per input leads to subspace interference and prevents selective reuse of useful components across tasks. 2) Redundancy: Newly added experts often duplicate or contradict existing knowledge due to unnecessary activation of unrelated ranks and insufficient reuse of relevant ones. 3) Ambiguity: Overlapping features across tasks confuse the router, resulting in unstable expert assignments. As more experts accumulate, earlier task routing degrades, accelerating forgetting. We propose MoRA, a Mixture-of-Rank Adaptive learning approach with self-activated and sparse rank activation for CL. Unlike mixing multiple low-rank matrices, MoRA decomposes each rank-r update into r rank-1 components, each treated as an independent expert, enabling fine-grained mixture of rank-1 expert utilization while mitigating interference and redundancy. To avoid ambiguous routing, we propose that each rank-1 expert can infer its own relevance via intermediate activations. Coupled with our proposed rank pruning and activation budgets, MoRA adaptively selects a sparse mixture of ranks per input. We validate MoRA on continual learning tasks with CLIP and large language models (LLMs), analyzing both in-domain learning and out-of-domain forgetting/generalization during fine-tuning. MoRA shows significant effectiveness on enhancing CL with PTMs, and improving generalization while mitigating forgetting.