Little By Little: Continual Learning via Self-Activated Sparse Mixture-of-Rank Adaptive Learning

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address catastrophic forgetting and task interference in continual learning of large models, this paper proposes a rank-level fine-grained adaptive Mixture-of-Experts (MoE) framework. Unlike conventional LoRA-MoE approaches employing coarse-grained expert selection, our method decomposes low-rank updates into multiple independent rank-1 experts and integrates self-activating sparsity with routing inference grounded in intermediate-layer activations, enabling input-driven dynamic sparse selection. Rank pruning, activation budget control, and self-assessing routing jointly mitigate subspace interference, redundant parameter updates, and routing ambiguity. Experiments on CLIP and large language models demonstrate that the method significantly reduces forgetting rates while enhancing both forward and backward transfer capabilities. Moreover, routing stability is preserved as the number of experts scales up.

Technology Category

Application Category

📝 Abstract

Continual learning (CL) with large pre-trained models is challenged by catastrophic forgetting and task interference. Existing LoRA-based Mixture-of-Experts (MoE) approaches mitigate forgetting by assigning and freezing task-specific adapters, but suffer from interference, redundancy, and ambiguous routing due to coarse adapter-level selection. However, this design introduces three key challenges: 1) Interference: Activating full LoRA experts per input leads to subspace interference and prevents selective reuse of useful components across tasks. 2) Redundancy: Newly added experts often duplicate or contradict existing knowledge due to unnecessary activation of unrelated ranks and insufficient reuse of relevant ones. 3) Ambiguity: Overlapping features across tasks confuse the router, resulting in unstable expert assignments. As more experts accumulate, earlier task routing degrades, accelerating forgetting. We propose MoRA, a Mixture-of-Rank Adaptive learning approach with self-activated and sparse rank activation for CL. Unlike mixing multiple low-rank matrices, MoRA decomposes each rank-r update into r rank-1 components, each treated as an independent expert, enabling fine-grained mixture of rank-1 expert utilization while mitigating interference and redundancy. To avoid ambiguous routing, we propose that each rank-1 expert can infer its own relevance via intermediate activations. Coupled with our proposed rank pruning and activation budgets, MoRA adaptively selects a sparse mixture of ranks per input. We validate MoRA on continual learning tasks with CLIP and large language models (LLMs), analyzing both in-domain learning and out-of-domain forgetting/generalization during fine-tuning. MoRA shows significant effectiveness on enhancing CL with PTMs, and improving generalization while mitigating forgetting.

Problem

Research questions and friction points this paper is trying to address.

Mitigates catastrophic forgetting and task interference in continual learning

Reduces redundancy and ambiguity in expert activation and routing

Enables fine-grained mixture of rank-1 experts for adaptive learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-activated sparse rank-1 expert mixture

Fine-grained rank-1 component utilization

Intermediate activation-based relevance inference

🔎 Similar Papers

No similar papers found.