LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address parameter explosion and catastrophic forgetting in Mixture-of-Experts (MoE) architectures for continual learning of large vision-language models, this paper proposes a replay-free continual MoE framework. The method introduces two core innovations: (1) a Probe-Guided Knowledge Extension (PGKE) mechanism that dynamically expands expert modules on-demand—without full-parameter growth—and (2) a Probabilistic Task Locator (PTL), a hierarchical routing algorithm that models task probabilities to decouple novel and historical knowledge, thereby preserving the functionality of pretrained experts. Crucially, the approach requires no storage of past task data. Evaluated on the Coin benchmark, it achieves significant gains in continual learning performance while maintaining controlled parameter growth. The framework effectively mitigates forgetting and improves parameter utilization efficiency, demonstrating strong scalability and knowledge retention in sequential multimodal learning settings.

Technology Category

Application Category

📝 Abstract
Although applying Mixture of Experts to large language models for learning new tasks is widely regarded as an effective strategy for continuous learning, there still remain two major challenges: (1) As the number of tasks grows, simple parameter expansion strategies can lead to excessively large models. (2) Modifying the parameters of the existing router results in the erosion of previously acquired knowledge. In this paper, we present an innovative framework named LLaVA-CMoE, which is a continuous Mixture of Experts (MoE) architecture without any replay data. Specifically, we have developed a method called Probe-Guided Knowledge Extension (PGKE), which employs probe experts to assess whether additional knowledge is required for a specific layer. This approach enables the model to adaptively expand its network parameters based on task distribution, thereby significantly improving the efficiency of parameter expansion. Additionally, we introduce a hierarchical routing algorithm called Probabilistic Task Locator (PTL), where high-level routing captures inter-task information and low-level routing focuses on intra-task details, ensuring that new task experts do not interfere with existing ones. Our experiments shows that our efficient architecture has substantially improved model performance on the Coin benchmark while maintaining a reasonable parameter count.
Problem

Research questions and friction points this paper is trying to address.

Prevents excessive model growth with increasing tasks
Avoids knowledge erosion in existing router parameters
Enhances parameter expansion efficiency adaptively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probe-Guided Knowledge Extension for adaptive parameter expansion
Hierarchical Probabilistic Task Locator routing algorithm
Continuous Mixture of Experts without replay data
🔎 Similar Papers
No similar papers found.