T-REX: Mixture-of-Rank-One-Experts with Semantic-aware Intuition for Multi-task Large Language Model Finetuning

📅 2024-04-13
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
To address the substantial parameter and computational overhead, as well as inefficient routing, induced by expert proliferation in Mixture-of-Experts (MoE) architectures during multi-task fine-tuning, this paper proposes Rank-1 MoE—a dynamic MoE architecture. Methodologically, it integrates LoRA, low-rank decomposition, and clustering-based modeling. Its key contributions are: (1) a novel rank-1 expert composition mechanism that achieves quadratic subspace expansion with only linear parameter growth; and (2) a router guided by embedding semantic clustering priors to enhance semantic awareness and training stability. Evaluated on 14 public benchmarks, Rank-1 MoE achieves an average accuracy improvement of 1.78%, reduces trainable parameters by 30–40%, and significantly outperforms existing LoRA-based approaches.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) encounter significant adaptation challenges in diverse multitask finetuning. Mixture-of-experts (MoE) provides a promising solution with a dynamic architecture, enabling effective task decoupling. However, scaling up the number of MoE experts incurs substantial parameter and computational overheads and suffers from limited performance gain due to naive routing mechanisms. In this paper, we design a novel framework, mixunderline{ extbf{T}}ureunderline{ extbf{-}}of-underline{ extbf{R}}ank-onunderline{ extbf{E}}-eunderline{ extbf{X}}perts ( exttt{T-REX}), which leverages the combination of ultra-low rank experts to construct LoRA weights on pretrained LLMs. The rank-1 experts enable a mix-and-match mechanism to quadratically expand the vector subspace of experts with linear parameter overheads, achieving approximate error reduction with optimal efficiency. In addition, T-REX offers implicit guidance to the router, leveraging the inherent semantic clustering of training embeddings as prior knowledge, enabling optimized feature allocation across experts for a smoother convergence. Extensive theoretical and empirical results demonstrate that T-REX achieves superior efficiency and generalizability across diverse tasks. Compared with other LoRA-based methods, T-REX achieves up to 1.78% mean accuracy improvement with around 30%-40% less trainable parameters across 14 public datasets. href{https://github.com/RoyZry98/T-REX-Pytorch}{Code} is available.
Problem

Research questions and friction points this paper is trying to address.

Addresses adaptation challenges in multitask LLM finetuning
Reduces parameter overhead in mixture-of-experts scaling
Improves routing via semantic-aware expert allocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Rank-One-Experts for efficient task decoupling
Ultra-low rank experts to construct LoRA weights
Semantic-aware routing for optimized feature allocation
🔎 Similar Papers
No similar papers found.
R
Rongyu Zhang
Nanjing University, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Yijiang Liu
Yijiang Liu
PhD
Machine Learning Efficiency
Huanrui Yang
Huanrui Yang
Assistant Professor, ECE, University of Arizona
Efficient deep learningTrustworthy deep learning
S
Shenli Zheng
D
Dan Wang
Y
Yuan Du
Nanjing University
L
Li Du
Nanjing University
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models