L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the limitations of existing Mixture-of-Experts (MoE) models, whose linear routing in high-dimensional raw input spaces suffers from representation mismatch, angular concentration, and sensitivity to feature scaling—leading to weak routing discriminability and unstable expert specialization. To overcome these issues, the authors propose the L2R framework, which jointly introduces a low-rank latent routing space and a Lipschitz-constrained Saturated Inner Product Scoring (SIPS) mechanism. By employing a multi-anchor, parameter-efficient routing strategy, L2R enhances expert specialization while preserving the geometric smoothness of the routing function. The method significantly improves both routing stability and overall model performance, demonstrating consistent gains across large-scale language modeling and ImageNet vision MoE benchmarks.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) models scale neural networks by conditionally activating a small subset of experts, where the router plays a central role in determining expert specialization and overall model performance. However, many modern MoE systems still adopt linear routers in raw high-dimensional representation spaces, where representation mismatch, angular concentration, and scale-sensitive scoring can jointly undermine routing discriminability and stable expert specialization. In this work, we propose Low-rank \&Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry. L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry. In addition, L2R incorporates a parameter-efficient multi-anchor routing mechanism to enhance expert expressiveness. Extensive experiments on a large-scale language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing stability, expert specialization, and overall model performance.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

routing

representation mismatch

angular concentration

scale-sensitive scoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank routing

Lipschitz control

Mixture-of-Experts