Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the performance degradation observed when directly constructing Mixture-of-Experts (MoE) architectures from heterogeneous pretrained models of identical architecture—e.g., Llama2-Chat and Code Llama—due to misaligned parameter spaces. We propose a two-stage “Harmonious Fusion” framework that requires no model retraining. In Stage I, activation-based functional alignment ensures semantic consistency across experts; in Stage II, layer-aware weight fusion and lightweight router training coordinate expert specialization under a shared backbone. To our knowledge, this is the first method enabling plug-and-play MoE integration across pretrained models with distinct training objectives yet identical architectures. Experiments demonstrate substantial gains over baselines across diverse downstream tasks and out-of-distribution generalization, while enhancing expert diversity and model robustness.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely, minimizing computational overhead. To circumvent the prohibitive cost of training MoEs from scratch, recent work employs upcycling, reusing a single pre-trained dense model by replicating its feed-forward network (FFN) layers into experts. However, this limits expert diversity, as all experts originate from a single pre-trained dense model. This paper addresses this limitation by constructing powerful MoE models using experts sourced from multiple identically-architected but disparate pre-trained models (e.g., Llama2-Chat and Code Llama). A key challenge lies in the fact that these source models occupy disparate, dissonant regions of the parameter space, making direct upcycling prone to severe performance degradation. To overcome this, we propose Symphony-MoE, a novel two-stage framework designed to harmonize these models into a single, coherent expert mixture. First, we establish this harmony in a training-free manner: we construct a shared backbone via a layer-aware fusion strategy and, crucially, alleviate parameter misalignment among experts using activation-based functional alignment. Subsequently, a single lightweight stage of router training coordinates the entire architecture. Experiments demonstrate that our method successfully integrates experts from heterogeneous sources, achieving an MoE model that significantly surpasses baselines in multi-domain tasks and out-of-distribution generalization.

Problem

Research questions and friction points this paper is trying to address.

Limiting expert diversity in MoE models using single pre-trained sources

Harmonizing disparate pre-trained models with parameter space misalignment

Preventing performance degradation when integrating heterogeneous expert sources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Harmonizing disparate pre-trained models into MoE

Layer-aware fusion and functional alignment for coherence

Lightweight router training for multi-domain generalization

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions