🤖 AI Summary
In Mixture-of-Experts (MoE) models, simple routers—e.g., linear routers—struggle with complex routing tasks during upcycling, leading to suboptimal performance. Method: This paper proposes Router Upcycling, a novel approach that repurposes multiple attention heads from a preceding attention layer to initialize a set of dedicated, cooperative routers. These routers jointly perform fine-grained, feature-aligned token-to-expert assignment by explicitly modeling semantic matching between token queries and expert representations via the attention mechanism. Contribution/Results: Router Upcycling introduces, for the first time, a multi-router collaboration framework grounded in attention structure, significantly improving routing accuracy and model expressivity. Experiments demonstrate state-of-the-art performance across multiple benchmarks, with substantially reduced training overhead—achieved without introducing any additional parameters or computational cost.
📝 Abstract
The Mixture-of-Experts (MoE) models have gained significant attention in deep learning due to their dynamic resource allocation and superior performance across diverse tasks. However, efficiently training these models remains challenging. The MoE upcycling technique has been proposed to reuse and improve existing model components, thereby minimizing training overhead. Despite this, simple routers, such as linear routers, often struggle with complex routing tasks within MoE upcycling. In response, we propose a novel routing technique called Router Upcycling to enhance the performance of MoE upcycling models. Our approach initializes multiple routers from the attention heads of preceding attention layers during upcycling. These routers collaboratively assign tokens to specialized experts in an attention-like manner. Each token is processed into diverse queries and aligned with the experts' features (serving as keys). Experimental results demonstrate that our method achieves state-of-the-art (SOTA) performance, outperforming other upcycling baselines.