Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Conventional Mixture-of-Experts (MoE) models universally employ Softmax for router weight normalization, implicitly enforcing projection onto the probability simplex—an assumption hitherto unchallenged. Method: This work establishes, for the first time, the mathematical equivalence between MoE routing and Nadaraya–Watson kernel regression. Leveraging this insight, we propose KERN: a kernel-inspired router implemented via a feed-forward network (FFN), replacing Softmax with ReLU activation and ℓ₂ normalization—thereby eliminating probabilistic normalization constraints without additional computational overhead. This framework unifies and generalizes both Sigmoid- and Softmax-based routing. Contribution/Results: Extensive experiments across diverse MoE architectures and large language models demonstrate that KERN matches or surpasses Softmax in performance while incurring zero inference latency overhead and exhibiting improved training stability.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the extbf{zero-additional-cost} Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to $mathrm{Softmax}$. We demonstrate that this router generalizes both $mathrm{Sigmoid}$- and $mathrm{Softmax}$-based routers. extbf{Based on empirical observations and established practices in FFN implementation, we recommend the use of $mathrm{ReLU}$ activation and $ell_2$-normalization in $mathrm{KERN}$ router function.} Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function methodNorm.

Problem

Research questions and friction points this paper is trying to address.

Challenges the necessity of Softmax in MoE routing

Reinterprets MoE through Nadaraya-Watson kernel regression framework

Proposes alternative FFN-style router function to replace Softmax

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces Softmax router with Nadaraya-Watson kernel formulation

Proposes FFN-style KERN router using ReLU activation

Uses l2-normalization in router function instead of Softmax

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions