DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the high computational cost of large language model (LLM) inference and the knowledge degradation inherent in conventional sparsification methods (e.g., pruning), this paper proposes Dynamic Sparse Mixture of Experts (DSMoE). Methodologically, DSMoE introduces three key innovations: (1) matrix-block reparameterization of pretrained feed-forward network (FFN) layers to achieve parameter-preserving sparsification; (2) a novel matrix-partitioned expert structure coupled with differentiable token routing via sigmoid-gated selection augmented with the straight-through estimator (STE); and (3) a differentiable sparsity loss enabling knowledge-aware, computation-adaptive sparsification—first realized in dense LLMs. Evaluated on the LLaMA architecture, DSMoE consistently outperforms both pruning-based and conventional MoE baselines across language modeling and downstream tasks under matched FLOPs, with particularly notable gains in generation tasks. Further analysis reveals layer-wise heterogeneous activation patterns.

Technology Category

Application Category

📝 Abstract

As large language models continue to scale, computational costs and resource consumption have emerged as significant challenges. While existing sparsification methods like pruning reduce computational overhead, they risk losing model knowledge through parameter removal. This paper proposes DSMoE (Dynamic Sparse Mixture-of-Experts), a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge based on input complexity. Additionally, we introduce a sparsity loss term to balance performance and computational efficiency. Extensive experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches across language modeling and downstream tasks, particularly excelling in generation tasks. Analysis reveals that DSMoE learns distinctive layerwise activation patterns, providing new insights for future MoE architecture design.

Problem

Research questions and friction points this paper is trying to address.

Reduce computational costs in large language models

Maintain model knowledge during sparsification

Enhance efficiency in generation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Partitioned FFN layers

Dynamic expert routing

Sparsity loss term

🔎 Similar Papers

Routing Experts: Learning to Route Dynamic Experts in Multi-modal Large Language Models

2024-07-19arXiv.orgCitations: 1

Together AI

$160,000 - $230,000 + equity + benefits

San Francisco, Singapore, Amsterdam / Remote

Authors to Follow