Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation observed when directly constructing Mixture-of-Experts (MoE) architectures from heterogeneous pretrained models of identical architecture—e.g., Llama2-Chat and Code Llama—due to misaligned parameter spaces. We propose a two-stage “Harmonious Fusion” framework that requires no model retraining. In Stage I, activation-based functional alignment ensures semantic consistency across experts; in Stage II, layer-aware weight fusion and lightweight router training coordinate expert specialization under a shared backbone. To our knowledge, this is the first method enabling plug-and-play MoE integration across pretrained models with distinct training objectives yet identical architectures. Experiments demonstrate substantial gains over baselines across diverse downstream tasks and out-of-distribution generalization, while enhancing expert diversity and model robustness.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely, minimizing computational overhead. To circumvent the prohibitive cost of training MoEs from scratch, recent work employs upcycling, reusing a single pre-trained dense model by replicating its feed-forward network (FFN) layers into experts. However, this limits expert diversity, as all experts originate from a single pre-trained dense model. This paper addresses this limitation by constructing powerful MoE models using experts sourced from multiple identically-architected but disparate pre-trained models (e.g., Llama2-Chat and Code Llama). A key challenge lies in the fact that these source models occupy disparate, dissonant regions of the parameter space, making direct upcycling prone to severe performance degradation. To overcome this, we propose Symphony-MoE, a novel two-stage framework designed to harmonize these models into a single, coherent expert mixture. First, we establish this harmony in a training-free manner: we construct a shared backbone via a layer-aware fusion strategy and, crucially, alleviate parameter misalignment among experts using activation-based functional alignment. Subsequently, a single lightweight stage of router training coordinates the entire architecture. Experiments demonstrate that our method successfully integrates experts from heterogeneous sources, achieving an MoE model that significantly surpasses baselines in multi-domain tasks and out-of-distribution generalization.
Problem

Research questions and friction points this paper is trying to address.

Limiting expert diversity in MoE models using single pre-trained sources
Harmonizing disparate pre-trained models with parameter space misalignment
Preventing performance degradation when integrating heterogeneous expert sources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Harmonizing disparate pre-trained models into MoE
Layer-aware fusion and functional alignment for coherence
Lightweight router training for multi-domain generalization
🔎 Similar Papers
No similar papers found.
Q
Qi Wang
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences
Hanyang Peng
Hanyang Peng
Peng Cheng Laboratory
Deep LearningOptimization
Y
Yue Yu
Peng Cheng Laboratory