🤖 AI Summary
Conventional Mixture-of-Experts (MoE) models universally employ Softmax for router weight normalization, implicitly enforcing projection onto the probability simplex—an assumption hitherto unchallenged. Method: This work establishes, for the first time, the mathematical equivalence between MoE routing and Nadaraya–Watson kernel regression. Leveraging this insight, we propose KERN: a kernel-inspired router implemented via a feed-forward network (FFN), replacing Softmax with ReLU activation and ℓ₂ normalization—thereby eliminating probabilistic normalization constraints without additional computational overhead. This framework unifies and generalizes both Sigmoid- and Softmax-based routing. Contribution/Results: Extensive experiments across diverse MoE architectures and large language models demonstrate that KERN matches or surpasses Softmax in performance while incurring zero inference latency overhead and exhibiting improved training stability.
📝 Abstract
Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the extbf{zero-additional-cost} Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to $mathrm{Softmax}$. We demonstrate that this router generalizes both $mathrm{Sigmoid}$- and $mathrm{Softmax}$-based routers. extbf{Based on empirical observations and established practices in FFN implementation, we recommend the use of $mathrm{ReLU}$ activation and $ell_2$-normalization in $mathrm{KERN}$ router function.} Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function methodNorm.