ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning

📅 2025-01-25

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address the high computational and memory overhead in large language model (LLM) deployment, this paper proposes a fine-tuning-free, differentiable dynamic structural pruning method. Specifically, it reformulates dense LLMs’ MLP layers into sparsely activated Mixture-of-Experts (MoE) architectures, dynamically controlling the number of activated parameters per inference step—without permanently removing any parameters. The key contribution is the first realization of gradient-driven expert activation control, enabling end-to-end differentiable optimization and zero-shot adaptation. Evaluated across diverse models—including Phi-2, LLaMA-2/3, and Qwen-2.5—the method substantially outperforms existing structural pruning approaches. It achieves significant inference speedup while preserving near-original accuracy, demonstrating both high compression ratios and strong generalization across architectures and tasks.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated remarkable abilities in tackling a wide range of complex tasks. However, their huge computational and memory costs raise significant challenges in deploying these models on resource-constrained devices or efficiently serving them. Prior approaches have attempted to alleviate these problems by permanently removing less important model structures, yet these methods often result in substantial performance degradation due to the permanent deletion of model parameters. In this work, we tried to mitigate this issue by reducing the number of active parameters without permanently removing them. Specifically, we introduce a differentiable dynamic pruning method that pushes dense models to maintain a fixed number of active parameters by converting their MLP layers into a Mixture of Experts (MoE) architecture. Our method, even without fine-tuning, consistently outperforms previous structural pruning techniques across diverse model families, including Phi-2, LLaMA-2, LLaMA-3, and Qwen-2.5.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Resource Constraints

Performance Degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

ToMoE

Parameter Efficiency

Task-adaptive

🔎 Similar Papers

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts