Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In Mixture-of-Experts (MoE) models, misalignment between router decisions and expert capabilities hinders performance. This work proposes Expert–Router Coupling (ERC), an auxiliary loss that enforces a novel dual constraint: (1) each expert exhibits maximal activation on its dedicated proxy token; and (2) that proxy token achieves peak activation exclusively in its assigned expert. ERC is implemented via proxy-token modeling and perturbed embedding injection, introducing a fixed O(n²) computational overhead—where n is the number of experts—during pretraining of MoE-LLMs (3B–15B parameters), independent of batch size. Evaluated on trillion-token-scale training, ERC significantly improves downstream task performance. Moreover, it enables fine-grained, quantifiable monitoring and dynamic adjustment of expert specialization, thereby establishing a new paradigm for interpretable and controllable MoE training.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.
Problem

Research questions and friction points this paper is trying to address.

Aligns router decisions with expert capabilities in MoE models
Ensures each expert specializes in its assigned tokens effectively
Provides efficient and scalable expert-router coupling during training
Innovation

Methods, ideas, or system contributions that make the work stand out.

ERC loss couples router decisions with expert capabilities
Perturbed router embeddings enforce mutual activation constraints
Fixed-cost auxiliary loss independent of batch size
🔎 Similar Papers
No similar papers found.
Ang Lv
Ang Lv
Renmin University of China
Language Model
J
Jin Ma
ByteDance Seed, Renmin University of China, GSAI
Yiyuan Ma
Yiyuan Ma
Bytedance Seed
S
Siyuan Qiao
ByteDance Seed, Renmin University of China, GSAI