Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Current sparse Mixture-of-Experts (MoE) large language models suffer from poor router generalization on downstream tasks, resulting in substantial performance gaps—up to 10–20% lower accuracy compared to optimal routing. To address this, we propose Routing Manifold Alignment (RoMA), the first framework unifying task understanding and expert selection via manifold alignment. RoMA imposes manifold regularization to ensure semantically similar tasks activate similar experts across layers, while freezing backbone parameters and performing only lightweight router fine-tuning—thereby establishing a cross-layer task-expert binding mechanism. Evaluated on OLMoE, DeepSeekMoE, and Qwen3-MoE, RoMA consistently narrows the performance gap with oracle routing by 10–20% in accuracy. Moreover, it achieves state-of-the-art results across diverse benchmarks, outperforming all existing baselines.

Technology Category

Application Category

📝 Abstract

Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large language models since it can efficiently scale up the model capability without increasing the inference cost. However, evaluations on broad downstream tasks reveal a consistent suboptimality of the routers in existing MoE LLMs, which results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal routing. In this paper, we show that aligning the manifold of routing weights with that of task embedding can effectively reduce the gap and improve MoE LLMs'generalization performance. Our method,"Routing Manifold Alignment (RoMA)", introduces an additional manifold regularization term in the post-training objective and only requires lightweight finetuning of routers (with other parameters frozen). Specifically, the regularization encourages the routing weights of each sample to be close to those of its successful neighbors (whose routing weights lead to correct answers) in a task embedding space. Consequently, samples targeting similar tasks will share similar expert choices across layers. Building such bindings between tasks and experts over different samples is essential to achieve better generalization. Moreover, RoMA demonstrates the advantage of unifying the task understanding (by embedding models) with solution generation (by MoE LLMs). In experiments, we finetune routers in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA. Evaluations on diverse benchmarks and extensive comparisons with baselines show the substantial improvement brought by RoMA.

Problem

Research questions and friction points this paper is trying to address.

Aligns routing weights with task embeddings to reduce performance gap

Improves generalization of MoE LLMs through manifold regularization

Enhances expert selection consistency across similar task samples

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns routing weights manifold with task embedding manifold

Introduces manifold regularization for post-training fine-tuning

Enhances generalization by sharing expert choices across similar tasks

🔎 Similar Papers

Routing Experts: Learning to Route Dynamic Experts in Multi-modal Large Language Models