🤖 AI Summary
Current sparse Mixture-of-Experts (MoE) large language models suffer from poor router generalization on downstream tasks, resulting in substantial performance gaps—up to 10–20% lower accuracy compared to optimal routing. To address this, we propose Routing Manifold Alignment (RoMA), the first framework unifying task understanding and expert selection via manifold alignment. RoMA imposes manifold regularization to ensure semantically similar tasks activate similar experts across layers, while freezing backbone parameters and performing only lightweight router fine-tuning—thereby establishing a cross-layer task-expert binding mechanism. Evaluated on OLMoE, DeepSeekMoE, and Qwen3-MoE, RoMA consistently narrows the performance gap with oracle routing by 10–20% in accuracy. Moreover, it achieves state-of-the-art results across diverse benchmarks, outperforming all existing baselines.
📝 Abstract
Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large language models since it can efficiently scale up the model capability without increasing the inference cost. However, evaluations on broad downstream tasks reveal a consistent suboptimality of the routers in existing MoE LLMs, which results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal routing. In this paper, we show that aligning the manifold of routing weights with that of task embedding can effectively reduce the gap and improve MoE LLMs'generalization performance. Our method,"Routing Manifold Alignment (RoMA)", introduces an additional manifold regularization term in the post-training objective and only requires lightweight finetuning of routers (with other parameters frozen). Specifically, the regularization encourages the routing weights of each sample to be close to those of its successful neighbors (whose routing weights lead to correct answers) in a task embedding space. Consequently, samples targeting similar tasks will share similar expert choices across layers. Building such bindings between tasks and experts over different samples is essential to achieve better generalization. Moreover, RoMA demonstrates the advantage of unifying the task understanding (by embedding models) with solution generation (by MoE LLMs). In experiments, we finetune routers in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA. Evaluations on diverse benchmarks and extensive comparisons with baselines show the substantial improvement brought by RoMA.