🤖 AI Summary
To address severe language capability degradation in multimodal Mixture-of-Experts (MoE) large models upon integrating visual perception, this paper proposes a soft modality-aware routing mechanism—requiring no architectural modifications or extensive pure-text data. Our core innovation is a KL-divergence-based modality-aware routing regularization that jointly optimizes expert modality specialization and language capability preservation, enabling dynamic, differentiable expert assignment. Integrated with vision-instruction fine-tuning, the method retains 86.6% of original language performance using only 2.5% of the original pure-text data, while significantly outperforming baselines on multimodal understanding benchmarks. This approach breaks the traditional fine-tuning paradigm’s strong reliance on large-scale text corpora, achieving—for the first time—a unified balance between robust multimodal competence and high language fidelity under minimal text-data overhead.
📝 Abstract
Mixture of Experts (MoE) architectures have become a key approach for scaling large language models, with growing interest in extending them to multimodal tasks. Existing methods to build multimodal MoE models either incur high training costs or suffer from degraded language capabilities when adapting pretrained models. To address this, we propose Soft ModalityAware Routing (SMAR), a novel regularization technique that uses Kullback Leibler divergence to control routing probability distributions across modalities, encouraging expert specialization without modifying model architecture or heavily relying on textual data. Experiments on visual instruction tuning show that SMAR preserves language ability at 86.6% retention with only 2.5% pure text, outperforming baselines while maintaining strong multimodal performance. Our approach offers a practical and efficient solution to balance modality differentiation and language capabilities in multimodal MoE models.