SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation of multimodal large language models under continual instruction tuning, which is primarily caused by routing drift and expert drift. To mitigate these issues, the authors propose a stabilized mixture-of-experts mechanism that enhances training stability and generalization. Specifically, orthogonal subspace routing updates are introduced to alleviate routing inconsistency, while a curvature-aware expert update—guided by the historical input covariance—modulates functional coverage during optimization. Additionally, an adaptive expert freezing strategy is employed to minimize cross-task interference. By integrating sparse routing, orthogonal decomposition, curvature-aware scaling, and replay-free continual learning, the proposed method achieves state-of-the-art performance on multimodal continual instruction tuning benchmarks, significantly improving both stability and generalization.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually expand their capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. Recent methods leverage sparse expert routing to promote task specialization, but we find that the expert routing process suffers from drift as the data distribution evolves. For example, a grounding query that previously activated localization experts may instead be routed to irrelevant experts after learning OCR tasks. Meanwhile, the grounding-related experts can be overwritten by new tasks and lose their original functionality. Such failure reflects two problems: router drift, where expert selection becomes inconsistent over time, and expert drift, where shared experts are overwritten across tasks. Therefore, we propose StAbilized Mixture-of-Experts (SAME) for MCIT. To address router drift, SAME stabilizes expert selection by decomposing routing dynamics into orthogonal subspaces and updating only task-relevant directions. To mitigate expert drift, we regulate expert updates via curvature-aware scaling using historical input covariance in a rehearsal-free manner. SAME also introduces adaptive expert activation to freeze selected experts during training, reducing redundant computation and cross-task interference. Extensive experiments demonstrate its SOTA performance.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Continual Instruction Tuning
Mixture-of-Experts
Router Drift
Expert Drift
Continual Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Continual Learning
Router Drift
Expert Drift
Multimodal Instruction Tuning
🔎 Similar Papers
No similar papers found.
Z
Zhen-Hao Xie
State Key Laboratory for Novel Software Technology, Nanjing University, China
J
Jun-Tao Tang
State Key Laboratory for Novel Software Technology, Nanjing University, China
Y
Yu-Cheng Shi
State Key Laboratory for Novel Software Technology, Nanjing University, China
Han-Jia Ye
Han-Jia Ye
Nanjing University
Machine LearningData MiningMetric LearningMeta-Learning
De-Chuan Zhan
De-Chuan Zhan
Nanjing University, China
Machine LearningData Mining
Da-Wei Zhou
Da-Wei Zhou
Associate Researcher, Nanjing University
Incremental LearningContinual LearningOpen-World LearningModel Reuse