On Linear Mode Connectivity of Mixture-of-Experts Architectures

📅 2025-09-14

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Linear mode connectivity (LMC)—the existence of low-loss linear paths between independently trained models in parameter space—remains unexplored in Mixture-of-Experts (MoE) architectures, whose structural heterogeneity (e.g., expert and gating module permutation symmetries) impedes direct parameter alignment. Method: We propose an expert-gating joint matching alignment algorithm to resolve permutation ambiguities and enable effective cross-model parameter alignment. Contribution/Results: This work is the first to systematically establish the existence of LMC in MoE models across diverse architectures—including dense, sparse, and shared-expert variants—and on multimodal, multi-scale benchmarks. Empirical validation confirms LMC’s universality and demonstrates substantial gains in model ensembling (+1.8% average accuracy). Moreover, our findings provide novel theoretical insights into the geometry of MoE loss landscapes, optimization dynamics, and generalization mechanisms—advancing both the understanding and practical deployment of MoE models.

Technology Category

Application Category

📝 Abstract

Linear Mode Connectivity (LMC) is a notable phenomenon in the loss landscapes of neural networks, wherein independently trained models have been observed to be connected--up to permutation symmetries--by linear paths in parameter space along which the loss remains consistently low. This observation challenges classical views of non-convex optimization and has implications for model ensembling, generalization, and our understanding of neural loss geometry. Inspired by recent studies on LMC in standard neural networks, we systematically investigate this phenomenon within Mixture-of-Experts (MoE) architectures--a class of models known for their scalability and computational efficiency, which combine traditional neural networks--referred to as experts--through a learnable gating mechanism. We begin by conducting a comprehensive analysis of both dense and sparse gating regimes, demonstrating that the symmetries inherent to MoE architectures are fully characterized by permutations acting on both the expert components and the gating function. Building on these foundational findings, we propose a matching algorithm that enables alignment between independently trained MoEs, thereby facilitating the discovery of LMC. Finally, we empirically validate the presence of LMC using our proposed algorithm across diverse MoE configurations--including dense, sparse, and shared-expert variants--under a wide range of model settings and datasets of varying scales and modalities. Our results confirm the existence of LMC in MoE architectures and offer fundamental insights into the functional landscape and optimization dynamics of deep learning models.

Problem

Research questions and friction points this paper is trying to address.

Investigating Linear Mode Connectivity in Mixture-of-Experts architectures

Characterizing permutation symmetries in MoE expert components and gating

Developing matching algorithm to enable connectivity between independently trained MoEs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Matching algorithm aligns independently trained MoEs

Analyzes dense and sparse gating regimes symmetries

Empirically validates LMC across diverse MoE configurations

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions