DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost of large language model (LLM) inference and the knowledge degradation inherent in conventional sparsification methods (e.g., pruning), this paper proposes Dynamic Sparse Mixture of Experts (DSMoE). Methodologically, DSMoE introduces three key innovations: (1) matrix-block reparameterization of pretrained feed-forward network (FFN) layers to achieve parameter-preserving sparsification; (2) a novel matrix-partitioned expert structure coupled with differentiable token routing via sigmoid-gated selection augmented with the straight-through estimator (STE); and (3) a differentiable sparsity loss enabling knowledge-aware, computation-adaptive sparsification—first realized in dense LLMs. Evaluated on the LLaMA architecture, DSMoE consistently outperforms both pruning-based and conventional MoE baselines across language modeling and downstream tasks under matched FLOPs, with particularly notable gains in generation tasks. Further analysis reveals layer-wise heterogeneous activation patterns.

Technology Category

Application Category

📝 Abstract
As large language models continue to scale, computational costs and resource consumption have emerged as significant challenges. While existing sparsification methods like pruning reduce computational overhead, they risk losing model knowledge through parameter removal. This paper proposes DSMoE (Dynamic Sparse Mixture-of-Experts), a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge based on input complexity. Additionally, we introduce a sparsity loss term to balance performance and computational efficiency. Extensive experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches across language modeling and downstream tasks, particularly excelling in generation tasks. Analysis reveals that DSMoE learns distinctive layerwise activation patterns, providing new insights for future MoE architecture design.
Problem

Research questions and friction points this paper is trying to address.

Reduce computational costs in large language models
Maintain model knowledge during sparsification
Enhance efficiency in generation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Partitioned FFN layers
Dynamic expert routing
Sparsity loss term
🔎 Similar Papers
No similar papers found.
M
Minxuan Lv
Institute of Information Engineering, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Z
Zhenpeng Su
Institute of Information Engineering, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Kuaishou Technology
Leiyu Pan
Leiyu Pan
Tianjin University
Natural Language ProcessingMultilingualMachine Translation
Yizhe Xiong
Yizhe Xiong
Tsinghua University
Transfer LearningComputer VisionLarge Language Models
Zijia Lin
Zijia Lin
Tsinghua University
information retrievalcomputer visionnatural language processingmachine learning
H
Hui Chen
Tsinghua University
W
Wei Zhou
Institute of Information Engineering, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Jungong Han
Jungong Han
Chair Professor in Computer Vision, University of Sheffield, UK, FIAPR, FAAIA
Computer VisionVideo AnalyticsMachine Learning
Guiguang Ding
Guiguang Ding
Tsinghua University
Computer VisionMultimedia Retrieval
C
Cheng Luo
Kuaishou Technology
D
Di Zhang
Kuaishou Technology
Kun Gai
Kun Gai
Senior Director & Researcher, Alibaba Group
Machine LearningComputational Advertising
S
Songlin Hu
Institute of Information Engineering, Chinese Academy of Sciences; University of Chinese Academy of Sciences