ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization

๐Ÿ“… 2025-11-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Mixture-of-Experts (MoE) models suffer from routing instability, low expert utilization, and load imbalance; existing load-balancing losses often impair expert specialization. This paper proposes ERMoE, which reparameterizes experts as learned orthogonal feature bases and employs cosine similarity between inputs and bases for content-aware routing. It introduces, for the first time, a feature-basis-driven gating mechanism that eliminates explicit load-balancing lossesโ€”thereby jointly ensuring routing stability, gradient purity, and emergent load balancing. Within the Transformer framework, ERMoE achieves feature-space alignment and interpretable expert specialization. Experiments demonstrate state-of-the-art performance on ImageNet classification and cross-modal retrieval. In 3D MRI-based brain age prediction, ERMoE reduces mean absolute error by over 7%, while expert assignments exhibit anatomically interpretable functional specialization.

Technology Category

Application Category

๐Ÿ“ Abstract
Mixture-of-Experts (MoE) architectures expand model capacity by sparsely activating experts but face two core challenges: misalignment between router logits and each expert's internal structure leads to unstable routing and expert underutilization, and load imbalances create straggler bottlenecks. Standard solutions, such as auxiliary load-balancing losses, can reduce load disparities but often weaken expert specialization and hurt downstream performance. To address these issues, we propose ERMoE, a sparse MoE transformer that reparameterizes each expert in a learned orthonormal eigenbasis and replaces learned gating logits with an"Eigenbasis Score", defined as the cosine similarity between input features and an expert's basis. This content-aware routing ties token assignments directly to experts'representation spaces, stabilizing utilization and promoting interpretable specialization without sacrificing sparsity. Crucially, ERMoE removes the need for explicit balancing losses and avoids the interfering gradients they introduce. We show that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks (e.g., COCO, Flickr30K), while naturally producing flatter expert load distributions. Moreover, a 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by more than 7% and yields anatomically interpretable expert specializations. ERMoE thus introduces a new architectural principle for sparse expert models that directly addresses routing instabilities and enables improved performance with scalable, interpretable specialization.
Problem

Research questions and friction points this paper is trying to address.

Stabilizes routing in sparse MoE models
Eliminates need for explicit load balancing losses
Enables interpretable expert specialization without sparsity loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reparameterizes experts in learned orthonormal eigenbasis
Uses cosine similarity for content-aware routing
Eliminates need for explicit load-balancing losses
๐Ÿ”Ž Similar Papers
No similar papers found.