Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities

📅 2025-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Efficiently extending pretrained large language models (LLMs) to multimodal generation remains challenging, particularly under strict parameter-budget constraints. Method: This paper proposes a zero-parameter-addition MoE architecture reuse framework: (i) treating inherent expert redundancy in MoE-LLMs as transferable multimodal learning resources; (ii) introducing a cross-modal weight initialization strategy grounded in Gromov–Wasserstein distance; and (iii) leveraging the emergent modality-specific routing pathways naturally formed by the MoE router to enable image generation. Contribution/Results: The method activates redundant experts solely via LoRA fine-tuning, preserving original language capabilities with negligible degradation (<0.3% performance loss), while substantially improving training stability and convergence speed. Extensive evaluation on mainstream MoE-LLMs—including Mixtral and DeepSpeed-MoE—demonstrates broad applicability, architectural scalability, and strong generalization across modalities.

Technology Category

Application Category

📝 Abstract
In this work, we undertake the challenge of augmenting the existing generative capabilities of pre-trained text-only large language models (LLMs) with multi-modal generation capability while satisfying two core constraints: C1 preserving the preservation of original language generative capabilities with negligible performance degradation, and C2 adhering to a small parameter budget to learn the new modality, ensuring scalability and efficiency. In contrast to current approaches that add dedicated modules, thereby significantly increasing the parameter count, we propose a method that leverages the underutilized capacity inherent in deep models. Specifically, we exploit the parameter redundancy within Mixture-of-Experts (MoEs) as a source of additional capacity for learning a new modality, enabling better parameter efficiency (C1). Moreover, we preserve the original language generation capabilities by applying low-rank adaptation exclusively to the tokens of the new modality (C2). Furthermore, we introduce a novel parameter initialization scheme based on the Gromov-Wasserstein distance to improve convergence and training stability. Through an extensive analysis of the routing mechanism, we uncover the emergence of modality-specific pathways and decreased redundancy within the experts that can efficiently unlock multi-modal generative capabilities. Overall, our method can be seamlessly applied to a wide range of contemporary LLMs, providing a new pathway for transitioning from uni-modal to multi-modal architectures.
Problem

Research questions and friction points this paper is trying to address.

Augment text-only LLMs with multimodal generation capability
Preserve original language skills with minimal performance loss
Achieve multimodal learning under small parameter budget
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes Mixture-of-Experts redundancy for multimodal learning
Applies low-rank adaptation to new modality tokens
Introduces Gromov-Wasserstein distance for initialization stability
🔎 Similar Papers
No similar papers found.