Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

227K/year
🤖 AI Summary
This work addresses the challenge in sparse mixture-of-experts (SMoE) training where routing often collapses to a few experts, and conventional load-balancing losses typically undermine expert specialization. The study reveals, for the first time, a geometric coupling between routers and their corresponding experts: both receive gradients for the same input token that are aligned in direction but differ in magnitude, and they share routing history. Leveraging this insight, the authors propose an online K-means routing mechanism that dynamically assigns tokens based on cosine similarity, eliminating the need for auxiliary balancing losses. Experiments on billion-parameter SMoE models demonstrate a strong positive correlation between routing scores and expert activation strength. The proposed method achieves significantly improved load balance while maintaining low perplexity, confirming that geometric coupling is a key mechanism for effective routing.
📝 Abstract
Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by these hurdles, we study how routing decisions in SMoEs are formed mechanistically. First, we reveal a geometric coupling between routers and their corresponding experts. For a given token, the router weights for the selected expert and the expert weights processing it receive gradients along the same input direction, differing only in scalar coefficients. Thus, matched router--expert directions accumulate the same routed token history. This theoretical coupling also appears empirically in routing dynamics. In a $1$B SMoE trained from scratch, higher router scores predict stronger expert neuron activations, showing that routing decisions are mirrored inside the selected expert. Next, we analyze the effects of auxiliary load balancing on the router--expert geometric coupling, showing that such losses break this structure by spreading input-directed gradients across router weights, making distinct router directions nearly three times more similar to each other. Last, we demonstrate the centrality of geometric coupling for effective routing with a parameter-free online K-Means router, in which each expert maintains a running average of the hidden states routed to it and tokens are assigned based on cosine similarity. Compared with auxiliary-loss and loss-free balancing, this router achieves the lowest load imbalance with only a modest perplexity increase, indicating that geometric coupling captures a substantial part of what the router learns. Overall, our results explain how routers form assignment geometry that supports an effective division of labor.
Problem

Research questions and friction points this paper is trying to address.

Sparse Mixture-of-Experts
routing collapse
load balancing
expert specialization
geometric coupling
Innovation

Methods, ideas, or system contributions that make the work stand out.

geometric coupling
sparse mixture-of-experts
router-expert alignment
online K-Means routing
load balancing