Optimizing MoE Routers: Design, Implementation, and Evaluation in Transformer Models

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address router-induced load imbalance and accuracy degradation in Mixture-of-Experts (MoE) models, this work designs and systematically evaluates six router variants—including the novel MLP-Hadamard—enabling, for the first time, custom router replacement and end-to-end fine-tuning on quantized Qwen1.5-MoE. MLP-Hadamard introduces a structured sparse routing mechanism that enhances expert utilization while preserving high sparsity. Empirical analysis across BERT and Qwen1.5-MoE reveals that Linear routers incur the lowest latency, MLP and Attention routers offer superior expressivity, and MLP-Hadamard achieves the optimal trade-off among inference efficiency, load balancing, and parameter efficiency. This study establishes a reproducible benchmarking framework for MoE router design and delivers actionable insights for deploying efficient, production-ready MoE systems.

Technology Category

Application Category

📝 Abstract
Mixture of Experts (MoE) architectures increase large language model scalability, yet their performance depends on the router module that moves tokens to specialized experts. Bad routing can load imbalance and reduced accuracy. This project designed and implemented different router architectures within Transformer models to fix these limitations. We experimented with six distinct router variants Linear, Attention, Multi-Layer Perceptron (MLP), Hybrid, Hash, and our new MLP-Hadamard. We characterized these routers using BERT and the Qwen1.5-MoE model, looking at parameter efficiency, inference latency, routing entropy, and expert utilization patterns. Our evaluations showed distinct trade-offs: Linear routers offer speed, while MLP and Attention routers provide greater expressiveness. The MLP-Hadamard router shows a unique capability for structured, sparse routing. We successfully replaced and fine-tuned custom routers within the complex, quantized Qwen1.5-MoE model. This work provides a comparative analysis of MoE router designs and offers insights into optimizing their performance for efficient and effective large-scale model deployment.
Problem

Research questions and friction points this paper is trying to address.

Optimizing router modules in MoE architectures to improve performance
Addressing load imbalance and accuracy issues in token routing
Evaluating router designs for efficient large-scale model deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Designed six distinct MoE router variants
Evaluated routers using BERT and Qwen1.5-MoE
Introduced MLP-Hadamard for structured sparse routing
🔎 Similar Papers
No similar papers found.
D
Daniel Fidel Harvey
G
George Weale
Berk Yilmaz
Berk Yilmaz
Columbia University
AIMachine Learning