Selective Sinkhorn Routing for Improved Sparse Mixture of Experts

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing sparse mixture-of-experts (SMoE) models rely on auxiliary losses or noisy gating to enforce expert load balancing, leading to objective misalignment, parameter redundancy, and inefficient training—e.g., high computational overhead from Sinkhorn iterations. This work reformulates token-to-expert assignment as an optimal transport problem with explicit load-balancing constraints—a novel formulation in SMoE. We propose Selective Sinkhorn Routing: a differentiable, end-to-end trainable gating mechanism that directly computes gating scores from the transport plan, eliminating auxiliary losses and additional learnable modules. The method ensures lightweight, efficient, and balanced expert selection without compromising differentiability. Experiments across language modeling and image classification demonstrate substantial improvements in training speed, accuracy, and robustness to input corruption compared to prior SMoE approaches.

Technology Category

Application Category

📝 Abstract
Sparse Mixture-of-Experts (SMoE) has gained prominence as a scalable and computationally efficient architecture, enabling significant growth in model capacity without incurring additional inference costs. However, existing SMoE models often rely on auxiliary losses (e.g., z-loss, load balancing) and additional trainable parameters (e.g., noisy gating) to encourage expert diversity, leading to objective misalignment and increased model complexity. Moreover, existing Sinkhorn-based methods suffer from significant training overhead due to their heavy reliance on the computationally expensive Sinkhorn algorithm. In this work, we formulate token-to-expert assignment as an optimal transport problem, incorporating constraints to ensure balanced expert utilization. We demonstrate that introducing a minimal degree of optimal transport-based routing enhances SMoE performance without requiring auxiliary balancing losses. Unlike previous methods, our approach derives gating scores directly from the transport map, enabling more effective token-to-expert balancing, supported by both theoretical analysis and empirical results. Building on these insights, we propose Selective Sinkhorn Routing (SSR), a routing mechanism that replaces auxiliary loss with lightweight Sinkhorn-based routing. SSR promotes balanced token assignments while preserving flexibility in expert selection. Across both language modeling and image classification tasks, SSR achieves faster training, higher accuracy, and greater robustness to input corruption.
Problem

Research questions and friction points this paper is trying to address.

Improves sparse mixture of experts routing via optimal transport
Eliminates auxiliary losses and reduces model complexity
Achieves balanced expert utilization with faster training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses optimal transport for token-expert assignment
Derives gating scores directly from transport map
Replaces auxiliary loss with lightweight Sinkhorn routing
🔎 Similar Papers
No similar papers found.
Duc Nguyen
Duc Nguyen
Dickinson College
Computer Science
Huu Binh Ta
Huu Binh Ta
CS PhD Student at the University of Virginia
Generative ModelsTrustworthy AIReinforcement Learning
N
Nhuan Le Duc
Ho Chi Minh city University of Science, Vietnam National University, Vietnam
T
Tan M. Nguyen
Qualcomm AI Research
T
Toan Tran
Qualcomm AI Research