RotMoLE: Enhancing Mixture of Low-Rank Experts through Rotational Gating Mechanism

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the limitations of existing low-rank Mixture-of-Experts (MoE) models, which rely on scalar-weighted gating and thereby constrain expert representational capacity and generalization. To overcome this bottleneck, we propose RotMoLE, the first approach to integrate a rotation-based gating mechanism into the MoE-LoRA architecture. Instead of applying simple scalar scaling, RotMoLE performs rotation operations on the low-rank adapters of selected experts, substantially enhancing their specialization and parameter efficiency. This design enables effective modeling of multi-domain and multilingual tasks even under strict expert budget constraints. Extensive experiments demonstrate that RotMoLE significantly outperforms current methods in complex multitask settings, validating the efficacy of rotation-based gating in improving both model performance and generalization.

📝 Abstract

While Large Language Models (LLMs) are commonly fine-tuned to handle domain-specific tasks before being applied to vertical applications, adapting them to complex scenarios with diverse specialized knowledge remains challenging. Meanwhile, Mixture-of-Experts (MoE) architecture has risen as a crucial paradigm for training LLMs, and some recent works have also incorporated MoE into Parameter-Efficient Fine-Tuning (PEFT) to propose the Mixture of Low-rank Experts (MoE-LoRA), to enhance the power of low-rank adapters for learning complicated knowledge. However, conventional gating mechanisms in MoE typically apply only a scalar reweighing to selected experts, thereby limiting their underlying capacity of representation and generalization. Motivated and enabled by the low-rank structures in MoE-LoRA, we propose RotMoLE, a specialized MoE framework for low-rank experts featuring an additional rotation gate. Beyond simple scaling, RotMoLE implements a rotation mechanism for each selected expert, enabling superior expert exploitation and specialization for learning diverse data, especially when expert candidates are limited. Empirical results on complex multi-task and multilingual training scenarios validate our effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Mixture of Experts

Low-Rank Adaptation

Gating Mechanism

Parameter-Efficient Fine-Tuning

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rotational Gating

Mixture of Experts

Low-Rank Adaptation