ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes ExpertWeaver, a novel framework that addresses the inefficiency of existing methods for converting dense large language models into sparse Mixture-of-Experts (MoE) architectures, which often disrupt intrinsic activation structures. ExpertWeaver is the first to uncover a coarse-grained MoE structure implicitly embedded within the Gated Linear Unit (GLU) mechanism. By analyzing fine-grained neuron activation patterns, it automatically partitions neurons into general-purpose and task-specific subsets to construct shared and routed experts. The method requires no additional training and integrates neuron clustering, layer-adaptive expert partitioning, and dynamic structural pruning. Empirical results demonstrate that ExpertWeaver significantly outperforms current MoE initialization and pruning approaches, achieving higher inference efficiency while preserving model performance.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation. However, training high-quality MoEs from scratch is prohibitively expensive. A promising alternative is to convert pretrained dense models into sparse MoEs. Existing dense-to-MoE methods fall into two categories: \textbf{dynamic structural pruning} that converts dense models into MoE architectures with moderate sparsity to balance performance and inference efficiency, and \textbf{downcycling} approaches that use pretrained dense models to initialize highly sparse MoE architectures. However, existing methods break the intrinsic activation patterns within dense models, leading to suboptimal expert construction. In this work, we argue that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion. We show that the fine-grained neural-wise activation patterns of GLU reveal a coarse-grained structure, uncovering an inherent MoE architecture composed of consistently activated universal neurons and dynamically activated specialized neurons. Leveraging this discovery, we introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations. Our experiments demonstrate that ExpertWeaver significantly outperforms existing methods, both as a training-free dynamic structural pruning technique and as a downcycling strategy for superior MoE initialization.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
dense-to-MoE conversion
activation patterns
Gated Linear Unit
expert construction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Gated Linear Unit
dense-to-MoE conversion
activation patterns
training-free
🔎 Similar Papers
No similar papers found.