Router Upcycling: Leveraging Mixture-of-Routers in Mixture-of-Experts Upcycling

📅 2025-08-30

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

In Mixture-of-Experts (MoE) models, simple routers—e.g., linear routers—struggle with complex routing tasks during upcycling, leading to suboptimal performance. Method: This paper proposes Router Upcycling, a novel approach that repurposes multiple attention heads from a preceding attention layer to initialize a set of dedicated, cooperative routers. These routers jointly perform fine-grained, feature-aligned token-to-expert assignment by explicitly modeling semantic matching between token queries and expert representations via the attention mechanism. Contribution/Results: Router Upcycling introduces, for the first time, a multi-router collaboration framework grounded in attention structure, significantly improving routing accuracy and model expressivity. Experiments demonstrate state-of-the-art performance across multiple benchmarks, with substantially reduced training overhead—achieved without introducing any additional parameters or computational cost.

Technology Category

Application Category

📝 Abstract

The Mixture-of-Experts (MoE) models have gained significant attention in deep learning due to their dynamic resource allocation and superior performance across diverse tasks. However, efficiently training these models remains challenging. The MoE upcycling technique has been proposed to reuse and improve existing model components, thereby minimizing training overhead. Despite this, simple routers, such as linear routers, often struggle with complex routing tasks within MoE upcycling. In response, we propose a novel routing technique called Router Upcycling to enhance the performance of MoE upcycling models. Our approach initializes multiple routers from the attention heads of preceding attention layers during upcycling. These routers collaboratively assign tokens to specialized experts in an attention-like manner. Each token is processed into diverse queries and aligned with the experts' features (serving as keys). Experimental results demonstrate that our method achieves state-of-the-art (SOTA) performance, outperforming other upcycling baselines.

Problem

Research questions and friction points this paper is trying to address.

Enhancing MoE upcycling performance with improved routing

Overcoming limitations of simple routers in expert models

Initializing multiple routers from attention heads collaboratively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Initializes multiple routers from attention heads

Collaboratively assign tokens to specialized experts

Processes tokens into queries aligned with experts

🔎 Similar Papers

A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning