OMoE: Diversifying Mixture of Low-Rank Adaptation by Orthogonal Finetuning

📅 2025-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In LoRA-based Mixture-of-Experts (MoE) models, expert representation collapse and insufficient diversity lead to performance saturation. Method: This paper proposes a Stiefel-manifold-based orthogonalization training mechanism—introducing Stiefel constraints into MoE architectures for the first time—and explicitly enhances expert distinctness via Gram–Schmidt orthogonalization, requiring no additional hyperparameters and preserving the original optimization objective. The approach maintains low parameter overhead and sparse gating while improving expert representation orthogonality and inference stability. Results: Experiments on multiple commonsense reasoning benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance using only ~40% of the experts required by prior approaches, delivering higher accuracy, stronger robustness, and superior parameter efficiency in lightweight fine-tuning.

Technology Category

Application Category

📝 Abstract
Building mixture-of-experts (MoE) architecture for Low-rank adaptation (LoRA) is emerging as a potential direction in parameter-efficient fine-tuning (PEFT) for its modular design and remarkable performance. However, simply stacking the number of experts cannot guarantee significant improvement. In this work, we first conduct qualitative analysis to indicate that experts collapse to similar representations in vanilla MoE, limiting the capacity of modular design and computational efficiency. Ulteriorly, Our analysis reveals that the performance of previous MoE variants maybe limited by a lack of diversity among experts. Motivated by these findings, we propose Orthogonal Mixture-of-Experts (OMoE), a resource-efficient MoE variant that trains experts in an orthogonal manner to promote diversity. In OMoE, a Gram-Schmidt process is leveraged to enforce that the experts' representations lie within the Stiefel manifold. By applying orthogonal constraints directly to the architecture, OMoE keeps the learning objective unchanged, without compromising optimality. Our method is simple and alleviates memory bottlenecks, as it incurs minimal experts compared to vanilla MoE models. Experiments on diverse commonsense reasoning benchmarks demonstrate that OMoE can consistently achieve stable and efficient performance improvement when compared with the state-of-the-art methods while significantly reducing the number of required experts.
Problem

Research questions and friction points this paper is trying to address.

Model Adaptability
Computational Efficiency
Expert Diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Orthogonal Mixture of Experts
Low-Rank Adaptation
Stiefel Manifold
🔎 Similar Papers
No similar papers found.
J
Jinyuan Feng
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Z
Zhiqiang Pu
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Tianyi Hu
Tianyi Hu
Purdue University
Multi-phase flow
D
Dongmin Li
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Xiaolin Ai
Xiaolin Ai
Institute of Automation, Chinese Academy of Sciences
multi-agent systems
H
Huimu Wang
JD.COM