Split-on-Share: Mixture of Sparse Experts for Task-Agnostic Continual Learning

📅 2026-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the plasticity-stability dilemma in continual learning of large language models, where acquiring new tasks often leads to catastrophic forgetting of previously learned knowledge. The authors propose SETA, a novel framework that, for the first time in task-agnostic continual learning, explicitly decouples task-specific and shared knowledge. SETA employs a sparse mixture-of-experts (MoE) architecture to separate parameters and introduces a unified gating mechanism to dynamically compose experts. Furthermore, it integrates elastic weight consolidation with parameter-efficient fine-tuning to safeguard critical shared parameters. Experimental results demonstrate that SETA significantly outperforms existing parameter-efficient continual learning methods across multiple general and domain-specific benchmarks.

Technology Category

Application Category

📝 Abstract
Continual learning in Large Language Models (LLMs) is hindered by the plasticity-stability dilemma, where acquiring new capabilities often leads to catastrophic forgetting of previous knowledge. Existing methods typically treat parameters uniformly, failing to distinguish between specific task knowledge and shared capabilities. We introduce Mixture of Sparse Experts for Task-Agnostic Continual Learning, referred to as SETA, a framework that resolves the plasticity-stability conflict by decomposing the model into modular subspaces. Unlike standard updates, where tasks compete for the same parameters, SETA separates knowledge into unique experts, designed to isolate task-specific patterns, and shared experts, responsible for capturing common features. This structure is maintained through elastic weight anchoring, which protects critical shared knowledge and enables a unified gating network to automatically retrieve the correct expert combination for each task during inference. Extensive experiments across diverse domain-specific and general benchmarks demonstrate that SETA consistently outperforms state-of-the-art parameter-efficient fine-tuning-based continual learning methods.
Problem

Research questions and friction points this paper is trying to address.

continual learning
catastrophic forgetting
plasticity-stability dilemma
task-specific knowledge
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Sparse Experts
Continual Learning
Task-Agnostic
Elastic Weight Anchoring
Parameter-Efficient Fine-Tuning