Router Upcycling: Leveraging Mixture-of-Routers in Mixture-of-Experts Upcycling

📅 2025-08-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In Mixture-of-Experts (MoE) models, simple routers—e.g., linear routers—struggle with complex routing tasks during upcycling, leading to suboptimal performance. Method: This paper proposes Router Upcycling, a novel approach that repurposes multiple attention heads from a preceding attention layer to initialize a set of dedicated, cooperative routers. These routers jointly perform fine-grained, feature-aligned token-to-expert assignment by explicitly modeling semantic matching between token queries and expert representations via the attention mechanism. Contribution/Results: Router Upcycling introduces, for the first time, a multi-router collaboration framework grounded in attention structure, significantly improving routing accuracy and model expressivity. Experiments demonstrate state-of-the-art performance across multiple benchmarks, with substantially reduced training overhead—achieved without introducing any additional parameters or computational cost.

Technology Category

Application Category

📝 Abstract
The Mixture-of-Experts (MoE) models have gained significant attention in deep learning due to their dynamic resource allocation and superior performance across diverse tasks. However, efficiently training these models remains challenging. The MoE upcycling technique has been proposed to reuse and improve existing model components, thereby minimizing training overhead. Despite this, simple routers, such as linear routers, often struggle with complex routing tasks within MoE upcycling. In response, we propose a novel routing technique called Router Upcycling to enhance the performance of MoE upcycling models. Our approach initializes multiple routers from the attention heads of preceding attention layers during upcycling. These routers collaboratively assign tokens to specialized experts in an attention-like manner. Each token is processed into diverse queries and aligned with the experts' features (serving as keys). Experimental results demonstrate that our method achieves state-of-the-art (SOTA) performance, outperforming other upcycling baselines.
Problem

Research questions and friction points this paper is trying to address.

Enhancing MoE upcycling performance with improved routing
Overcoming limitations of simple routers in expert models
Initializing multiple routers from attention heads collaboratively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Initializes multiple routers from attention heads
Collaboratively assign tokens to specialized experts
Processes tokens into queries aligned with experts
🔎 Similar Papers
No similar papers found.
J
Junfeng Ran
National Key Laboratory for Multimedia Information Processing, Peking University
Guangxiang Zhao
Guangxiang Zhao
Peking University
AI
Yuhan Wu
Yuhan Wu
Peking University, Ph.D. student in CS, yuhan.wu [at] pku.edu.cn My Chinese name is 吴钰晗
Data StructuresNetworkingBig Data
D
Dawei Zhu
National Key Laboratory for Multimedia Information Processing, Peking University
L
Longyun Wu
National Key Laboratory for Multimedia Information Processing, Peking University
Yikai Zhao
Yikai Zhao
Peking University
NetworkingDistributed SystemsAlgorithms
T
Tong Yang
National Key Laboratory for Multimedia Information Processing, Peking University
Lin Sun
Lin Sun
Qihoo 360
large language model
Xiangzheng Zhang
Xiangzheng Zhang
360
AI safetyLarge language modelsInformation Retrieval
S
Sujian Li
National Key Laboratory for Multimedia Information Processing, Peking University