🤖 AI Summary
To address the high computational cost of large language model (LLM) inference and the knowledge degradation inherent in conventional sparsification methods (e.g., pruning), this paper proposes Dynamic Sparse Mixture of Experts (DSMoE). Methodologically, DSMoE introduces three key innovations: (1) matrix-block reparameterization of pretrained feed-forward network (FFN) layers to achieve parameter-preserving sparsification; (2) a novel matrix-partitioned expert structure coupled with differentiable token routing via sigmoid-gated selection augmented with the straight-through estimator (STE); and (3) a differentiable sparsity loss enabling knowledge-aware, computation-adaptive sparsification—first realized in dense LLMs. Evaluated on the LLaMA architecture, DSMoE consistently outperforms both pruning-based and conventional MoE baselines across language modeling and downstream tasks under matched FLOPs, with particularly notable gains in generation tasks. Further analysis reveals layer-wise heterogeneous activation patterns.
📝 Abstract
As large language models continue to scale, computational costs and resource consumption have emerged as significant challenges. While existing sparsification methods like pruning reduce computational overhead, they risk losing model knowledge through parameter removal. This paper proposes DSMoE (Dynamic Sparse Mixture-of-Experts), a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge based on input complexity. Additionally, we introduce a sparsity loss term to balance performance and computational efficiency. Extensive experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches across language modeling and downstream tasks, particularly excelling in generation tasks. Analysis reveals that DSMoE learns distinctive layerwise activation patterns, providing new insights for future MoE architecture design.