🤖 AI Summary
This work addresses the performance degradation of multimodal large language models under continual instruction tuning, which is primarily caused by routing drift and expert drift. To mitigate these issues, the authors propose a stabilized mixture-of-experts mechanism that enhances training stability and generalization. Specifically, orthogonal subspace routing updates are introduced to alleviate routing inconsistency, while a curvature-aware expert update—guided by the historical input covariance—modulates functional coverage during optimization. Additionally, an adaptive expert freezing strategy is employed to minimize cross-task interference. By integrating sparse routing, orthogonal decomposition, curvature-aware scaling, and replay-free continual learning, the proposed method achieves state-of-the-art performance on multimodal continual instruction tuning benchmarks, significantly improving both stability and generalization.
📝 Abstract
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually expand their capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. Recent methods leverage sparse expert routing to promote task specialization, but we find that the expert routing process suffers from drift as the data distribution evolves. For example, a grounding query that previously activated localization experts may instead be routed to irrelevant experts after learning OCR tasks. Meanwhile, the grounding-related experts can be overwritten by new tasks and lose their original functionality. Such failure reflects two problems: router drift, where expert selection becomes inconsistent over time, and expert drift, where shared experts are overwritten across tasks. Therefore, we propose StAbilized Mixture-of-Experts (SAME) for MCIT. To address router drift, SAME stabilizes expert selection by decomposing routing dynamics into orthogonal subspaces and updating only task-relevant directions. To mitigate expert drift, we regulate expert updates via curvature-aware scaling using historical input covariance in a rehearsal-free manner. SAME also introduces adaptive expert activation to freeze selected experts during training, reducing redundant computation and cross-task interference. Extensive experiments demonstrate its SOTA performance.