Self-Routing: Parameter-Free Expert Routing from Hidden States

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Self-Routing, a novel mechanism for Mixture-of-Experts (MoE) architectures that eliminates the need for dedicated learnable routers. Instead of relying on additional routing parameters, Self-Routing directly derives routing logits from a designated subspace of the hidden states, preserving the rest of the MoE structure unchanged. This approach constitutes the first parameter-free, endogenous routing scheme that inherently promotes balanced expert utilization without requiring explicit load-balancing losses. Experiments on both GPT-2-scale language models and DeiT-S/16 vision transformers demonstrate that Self-Routing matches or slightly surpasses conventional router-based MoE in language modeling and ImageNet-1K classification performance, while achieving approximately 17% higher routing entropy on average, indicating improved expert diversity and utilization.
📝 Abstract
Mixture-of-Experts (MoE) layers increase model capacity by activating only a small subset of experts per token, and typically rely on a learned router to map hidden states to expert assignments. In this work, we ask whether a dedicated learned router is strictly necessary in the MoE settings we study. We propose Self-Routing, a parameter-free routing mechanism that uses a designated subspace of the token hidden state directly as expert logits, eliminating the router projection entirely while leaving the rest of the MoE layer unchanged. We evaluate Self-Routing on GPT-2-scale language modeling and ImageNet-1K classification by comparing it against a standard learned router, random-routing baselines, and dense non-MoE baselines. Our results show that Self-Routing remains competitive with the learned-router baseline while removing all dedicated routing parameters, and yields more balanced expert utilization, with about 17 % higher average normalized routing entropy and no explicit load-balancing loss. On ImageNet-1K with DeiT-S/16, Self-Routing also slightly improves over the corresponding learned-router MoE. These findings suggest that effective MoE routing can emerge from the hidden representation itself without requiring a separate learned router module.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
expert routing
parameter-free
hidden states
router
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Routing
Mixture-of-Experts
parameter-free routing
expert utilization
hidden state subspace
🔎 Similar Papers
No similar papers found.
J
Jama Hussein Mohamud
Mila - Quebec AI Institute; Universite de Montreal
D
Drew Wagner
Mila - Quebec AI Institute; Concordia University
Mirco Ravanelli
Mirco Ravanelli
Concordia University, Université de Montréal, Mila Quebec AI Institute
Deep LearningConversational AIRepresentation LearningSpeech Processing