Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds

📅 2026-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of achieving a controllable trade-off between routing sparsity and expert utilization in traditional Mixture-of-Experts (MoE) models. We propose the first continuous sparse routing framework grounded in a concentration parameter Λ: by modeling the routing mechanism on the Grassmann manifold and leveraging the concentration parameter of the Matrix Bingham distribution to generate gating weights, our approach enables smooth control over routing entropy via a single tunable hyperparameter. This eliminates the need for discrete top-k selection and prevents expert collapse. Furthermore, we integrate variational inference to enable uncertainty-aware expert assignment. Experiments demonstrate that our method achieves 0% expert collapse across MoE models of varying scales, matches or improves upon baseline perplexity, enhances load balancing by 15–30%, and supports post-hoc sparsity adjustment without retraining.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts models rely on learned routers to assign tokens to experts, yet standard softmax gating provides no principled mechanism to control the tradeoff between sparsity and utilization. We propose Grassmannian MoE (GrMoE), a routing framework that operates on the Grassmannian manifold of subspaces, where gating weights arise from the concentration parameters of Matrix Bingham distributions. This construction yields a single, interpretable knob -- the concentration matrix $\Lambda$ -- that continuously controls routing entropy, replacing discrete top-$k$ selection with a smooth, geometrically principled sparsity mechanism. We further develop an amortized variational inference procedure for posterior routing distributions, enabling uncertainty-aware expert assignment that naturally resists expert collapse. We formally prove tight bounds relating the Bingham concentration spectrum to routing entropy, expected top-$k$ mass, and an exponential bound on expert collapse, establishing the first formal theory of concentration-controlled sparsity. On synthetic routing tasks, a 350M-parameter MoE language model with 8 experts, a 1.3B-parameter model with 16 experts, and a 2.7B-parameter model with 32 experts, GrMoE achieves 0\% routing collapse across all seeds, comparable or better perplexity with 15--30\% improved load balance, and a smooth monotonic relationship between concentration and effective sparsity that enables post-hoc sparsity tuning without retraining. Token-level analysis reveals that experts learn heterogeneous concentration values that correlate with linguistic specialization, providing interpretable routing behavior.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
routing sparsity
expert collapse
load balancing
Grassmannian manifold
Innovation

Methods, ideas, or system contributions that make the work stand out.

Grassmannian manifold
Mixture-of-Experts
concentration-controlled routing
Matrix Bingham distribution
expert collapse prevention