OMoE: Diversifying Mixture of Low-Rank Adaptation by Orthogonal Finetuning

📅 2025-01-17

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

In LoRA-based Mixture-of-Experts (MoE) models, expert representation collapse and insufficient diversity lead to performance saturation. Method: This paper proposes a Stiefel-manifold-based orthogonalization training mechanism—introducing Stiefel constraints into MoE architectures for the first time—and explicitly enhances expert distinctness via Gram–Schmidt orthogonalization, requiring no additional hyperparameters and preserving the original optimization objective. The approach maintains low parameter overhead and sparse gating while improving expert representation orthogonality and inference stability. Results: Experiments on multiple commonsense reasoning benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance using only ~40% of the experts required by prior approaches, delivering higher accuracy, stronger robustness, and superior parameter efficiency in lightweight fine-tuning.

Technology Category

Application Category

📝 Abstract

Building mixture-of-experts (MoE) architecture for Low-rank adaptation (LoRA) is emerging as a potential direction in parameter-efficient fine-tuning (PEFT) for its modular design and remarkable performance. However, simply stacking the number of experts cannot guarantee significant improvement. In this work, we first conduct qualitative analysis to indicate that experts collapse to similar representations in vanilla MoE, limiting the capacity of modular design and computational efficiency. Ulteriorly, Our analysis reveals that the performance of previous MoE variants maybe limited by a lack of diversity among experts. Motivated by these findings, we propose Orthogonal Mixture-of-Experts (OMoE), a resource-efficient MoE variant that trains experts in an orthogonal manner to promote diversity. In OMoE, a Gram-Schmidt process is leveraged to enforce that the experts' representations lie within the Stiefel manifold. By applying orthogonal constraints directly to the architecture, OMoE keeps the learning objective unchanged, without compromising optimality. Our method is simple and alleviates memory bottlenecks, as it incurs minimal experts compared to vanilla MoE models. Experiments on diverse commonsense reasoning benchmarks demonstrate that OMoE can consistently achieve stable and efficient performance improvement when compared with the state-of-the-art methods while significantly reducing the number of required experts.

Problem

Research questions and friction points this paper is trying to address.

Model Adaptability

Computational Efficiency

Expert Diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Orthogonal Mixture of Experts

Low-Rank Adaptation

Stiefel Manifold

🔎 Similar Papers

Mixture-of-Subspaces in Low-Rank Adaptation