🤖 AI Summary
Efficiently extending pretrained large language models (LLMs) to multimodal generation remains challenging, particularly under strict parameter-budget constraints. Method: This paper proposes a zero-parameter-addition MoE architecture reuse framework: (i) treating inherent expert redundancy in MoE-LLMs as transferable multimodal learning resources; (ii) introducing a cross-modal weight initialization strategy grounded in Gromov–Wasserstein distance; and (iii) leveraging the emergent modality-specific routing pathways naturally formed by the MoE router to enable image generation. Contribution/Results: The method activates redundant experts solely via LoRA fine-tuning, preserving original language capabilities with negligible degradation (<0.3% performance loss), while substantially improving training stability and convergence speed. Extensive evaluation on mainstream MoE-LLMs—including Mixtral and DeepSpeed-MoE—demonstrates broad applicability, architectural scalability, and strong generalization across modalities.
📝 Abstract
In this work, we undertake the challenge of augmenting the existing generative capabilities of pre-trained text-only large language models (LLMs) with multi-modal generation capability while satisfying two core constraints: C1 preserving the preservation of original language generative capabilities with negligible performance degradation, and C2 adhering to a small parameter budget to learn the new modality, ensuring scalability and efficiency. In contrast to current approaches that add dedicated modules, thereby significantly increasing the parameter count, we propose a method that leverages the underutilized capacity inherent in deep models. Specifically, we exploit the parameter redundancy within Mixture-of-Experts (MoEs) as a source of additional capacity for learning a new modality, enabling better parameter efficiency (C1). Moreover, we preserve the original language generation capabilities by applying low-rank adaptation exclusively to the tokens of the new modality (C2). Furthermore, we introduce a novel parameter initialization scheme based on the Gromov-Wasserstein distance to improve convergence and training stability. Through an extensive analysis of the routing mechanism, we uncover the emergence of modality-specific pathways and decreased redundancy within the experts that can efficiently unlock multi-modal generative capabilities. Overall, our method can be seamlessly applied to a wide range of contemporary LLMs, providing a new pathway for transitioning from uni-modal to multi-modal architectures.