LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address parameter explosion and catastrophic forgetting in Mixture-of-Experts (MoE) architectures for continual learning of large vision-language models, this paper proposes a replay-free continual MoE framework. The method introduces two core innovations: (1) a Probe-Guided Knowledge Extension (PGKE) mechanism that dynamically expands expert modules on-demand—without full-parameter growth—and (2) a Probabilistic Task Locator (PTL), a hierarchical routing algorithm that models task probabilities to decouple novel and historical knowledge, thereby preserving the functionality of pretrained experts. Crucially, the approach requires no storage of past task data. Evaluated on the Coin benchmark, it achieves significant gains in continual learning performance while maintaining controlled parameter growth. The framework effectively mitigates forgetting and improves parameter utilization efficiency, demonstrating strong scalability and knowledge retention in sequential multimodal learning settings.

Technology Category

Application Category

📝 Abstract

Although applying Mixture of Experts to large language models for learning new tasks is widely regarded as an effective strategy for continuous learning, there still remain two major challenges: (1) As the number of tasks grows, simple parameter expansion strategies can lead to excessively large models. (2) Modifying the parameters of the existing router results in the erosion of previously acquired knowledge. In this paper, we present an innovative framework named LLaVA-CMoE, which is a continuous Mixture of Experts (MoE) architecture without any replay data. Specifically, we have developed a method called Probe-Guided Knowledge Extension (PGKE), which employs probe experts to assess whether additional knowledge is required for a specific layer. This approach enables the model to adaptively expand its network parameters based on task distribution, thereby significantly improving the efficiency of parameter expansion. Additionally, we introduce a hierarchical routing algorithm called Probabilistic Task Locator (PTL), where high-level routing captures inter-task information and low-level routing focuses on intra-task details, ensuring that new task experts do not interfere with existing ones. Our experiments shows that our efficient architecture has substantially improved model performance on the Coin benchmark while maintaining a reasonable parameter count.

Problem

Research questions and friction points this paper is trying to address.

Prevents excessive model growth with increasing tasks

Avoids knowledge erosion in existing router parameters

Enhances parameter expansion efficiency adaptively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probe-Guided Knowledge Extension for adaptive parameter expansion

Hierarchical Probabilistic Task Locator routing algorithm

Continuous Mixture of Experts without replay data

🔎 Similar Papers

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders