π€ AI Summary
This work addresses the limitations of existing low-rank Mixture-of-Experts (MoE) models, which rely on scalar-weighted gating and thereby constrain expert representational capacity and generalization. To overcome this bottleneck, we propose RotMoLE, the first approach to integrate a rotation-based gating mechanism into the MoE-LoRA architecture. Instead of applying simple scalar scaling, RotMoLE performs rotation operations on the low-rank adapters of selected experts, substantially enhancing their specialization and parameter efficiency. This design enables effective modeling of multi-domain and multilingual tasks even under strict expert budget constraints. Extensive experiments demonstrate that RotMoLE significantly outperforms current methods in complex multitask settings, validating the efficacy of rotation-based gating in improving both model performance and generalization.
π Abstract
While Large Language Models (LLMs) are commonly fine-tuned to handle domain-specific tasks before being applied to vertical applications, adapting them to complex scenarios with diverse specialized knowledge remains challenging. Meanwhile, Mixture-of-Experts (MoE) architecture has risen as a crucial paradigm for training LLMs, and some recent works have also incorporated MoE into Parameter-Efficient Fine-Tuning (PEFT) to propose the Mixture of Low-rank Experts (MoE-LoRA), to enhance the power of low-rank adapters for learning complicated knowledge. However, conventional gating mechanisms in MoE typically apply only a scalar reweighing to selected experts, thereby limiting their underlying capacity of representation and generalization. Motivated and enabled by the low-rank structures in MoE-LoRA, we propose RotMoLE, a specialized MoE framework for low-rank experts featuring an additional rotation gate. Beyond simple scaling, RotMoLE implements a rotation mechanism for each selected expert, enabling superior expert exploitation and specialization for learning diverse data, especially when expert candidates are limited. Empirical results on complex multi-task and multilingual training scenarios validate our effectiveness.