🤖 AI Summary
To address task interference and catastrophic forgetting in multi-task adaptation of large language models, this paper proposes a Singular Value Decomposition (SVD)-based Mixture-of-Experts (MoE) method. The core innovation is an orthogonal rank-one expert architecture: each weight matrix is decomposed via SVD, and its left and right singular vectors—ensuring orthogonality—are used to construct rank-one expert bases that strictly preserve the original weight’s column space; a task-aware router dynamically modulates the singular values associated with each expert. This design simultaneously enforces expert orthogonality, suppresses cross-task interference, and retains previously learned knowledge. Experiments on multiple multi-task benchmarks demonstrate that our method significantly outperforms LoRA and its MoE variants, achieving superior robustness against task conflict and stronger resistance to catastrophic forgetting.
📝 Abstract
Adapting large-scale foundation models in multi-task scenarios often suffers from task conflict and oblivion. To mitigate such issues, we propose a novel ''model MoE-ization'' strategy that leads to a conflict- and oblivion-resistant multi-task adaptation method. Given a weight matrix of a pre-trained model, our method applies SVD to it and introduces a learnable router to adjust its singular values based on tasks and samples. Accordingly, the weight matrix becomes a Mixture of Orthogonal Rank-one Experts (MoORE), in which each expert corresponds to the outer product of a left singular vector and the corresponding right one. We can improve the model capacity by imposing a learnable orthogonal transform on the right singular vectors. Unlike low-rank adaptation (LoRA) and its MoE-driven variants, MoORE guarantees the experts' orthogonality and maintains the column space of the original weight matrix. These two properties make the adapted model resistant to the conflicts among the new tasks and the oblivion of its original tasks, respectively. Experiments on various datasets demonstrate that MoORE outperforms existing multi-task adaptation methods consistently, showing its superiority in terms of conflict- and oblivion-resistance. The code of the experiments is available at https://github.com/DaShenZi721/MoORE.