๐ค AI Summary
This work addresses the limitations of existing Mixture-of-Experts (MoE) models, whose linear routing in high-dimensional raw input spaces suffers from representation mismatch, angular concentration, and sensitivity to feature scalingโleading to weak routing discriminability and unstable expert specialization. To overcome these issues, the authors propose the L2R framework, which jointly introduces a low-rank latent routing space and a Lipschitz-constrained Saturated Inner Product Scoring (SIPS) mechanism. By employing a multi-anchor, parameter-efficient routing strategy, L2R enhances expert specialization while preserving the geometric smoothness of the routing function. The method significantly improves both routing stability and overall model performance, demonstrating consistent gains across large-scale language modeling and ImageNet vision MoE benchmarks.
๐ Abstract
Mixture-of-Experts (MoE) models scale neural networks by conditionally activating a small subset of experts, where the router plays a central role in determining expert specialization and overall model performance. However, many modern MoE systems still adopt linear routers in raw high-dimensional representation spaces, where representation mismatch, angular concentration, and scale-sensitive scoring can jointly undermine routing discriminability and stable expert specialization. In this work, we propose Low-rank \&Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry. L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry. In addition, L2R incorporates a parameter-efficient multi-anchor routing mechanism to enhance expert expressiveness. Extensive experiments on a large-scale language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing stability, expert specialization, and overall model performance.