On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models

📅 2026-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses catastrophic forgetting in large-scale vision-language models under continual learning, primarily caused by routing drift—where ambiguous or outdated samples are misrouted to newly added experts. To mitigate this, the authors propose LLaVA-DyMoE, a dynamic Mixture-of-Experts framework that, for the first time, identifies a “token dilemma” at the token level. By analyzing routing score distributions, the method classifies token types and applies targeted regularization to stabilize existing routing decisions while encouraging specialization of new experts. Integrating dynamic expert expansion, token-level routing guidance, and expert group separation, LLaVA-DyMoE achieves an average improvement of over 7% in final accuracy and reduces forgetting by 12% on multi-task continual learning benchmarks, significantly outperforming current state-of-the-art approaches.
📝 Abstract
Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.
Problem

Research questions and friction points this paper is trying to address.

Continual Learning
Mixture of Experts
Routing Drift
Vision Language Models
Catastrophic Forgetting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic MoE
routing-drift
token-level assignment
continual learning
expert specialization
🔎 Similar Papers
No similar papers found.