SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal Large Language Models Preserving Language Capabilities

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe language capability degradation in multimodal Mixture-of-Experts (MoE) large models upon integrating visual perception, this paper proposes a soft modality-aware routing mechanism—requiring no architectural modifications or extensive pure-text data. Our core innovation is a KL-divergence-based modality-aware routing regularization that jointly optimizes expert modality specialization and language capability preservation, enabling dynamic, differentiable expert assignment. Integrated with vision-instruction fine-tuning, the method retains 86.6% of original language performance using only 2.5% of the original pure-text data, while significantly outperforming baselines on multimodal understanding benchmarks. This approach breaks the traditional fine-tuning paradigm’s strong reliance on large-scale text corpora, achieving—for the first time—a unified balance between robust multimodal competence and high language fidelity under minimal text-data overhead.

Technology Category

Application Category

📝 Abstract
Mixture of Experts (MoE) architectures have become a key approach for scaling large language models, with growing interest in extending them to multimodal tasks. Existing methods to build multimodal MoE models either incur high training costs or suffer from degraded language capabilities when adapting pretrained models. To address this, we propose Soft ModalityAware Routing (SMAR), a novel regularization technique that uses Kullback Leibler divergence to control routing probability distributions across modalities, encouraging expert specialization without modifying model architecture or heavily relying on textual data. Experiments on visual instruction tuning show that SMAR preserves language ability at 86.6% retention with only 2.5% pure text, outperforming baselines while maintaining strong multimodal performance. Our approach offers a practical and efficient solution to balance modality differentiation and language capabilities in multimodal MoE models.
Problem

Research questions and friction points this paper is trying to address.

Balancing modality differentiation in multimodal MoE models
Preserving language capabilities during multimodal adaptation
Reducing training costs for multimodal MoE architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft Modality-Aware Routing (SMAR) technique
KL divergence controls routing probabilities
Preserves language capabilities efficiently
🔎 Similar Papers
No similar papers found.
G
Guoyang Xia
Beijing University of Posts and Telecommunications
Yifeng Ding
Yifeng Ding
University of Illinois at Urbana-Champaign
Software engineeringGenerative model
F
Fengfa Li
Li Auto
Lei Ren
Lei Ren
Li Auto
NLP、LLM、VLM
C
Chen Wei
Li Auto
Fangxiang Feng
Fangxiang Feng
Beijing University of Posts and Telecommunications
Multimodal LearningImage Synthesis
X
Xiaojie Wang
Beijing University of Posts and Telecommunications