Post-Trained MoE Can Skip Half Experts via Self-Distillation

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the challenge of efficiently converting static Mixture-of-Experts (MoE) models into dynamically sparsely activated ones without re-pretraining, thereby reducing inference costs. The authors propose Zero-Expert Self-Distillation Adaptation (ZEDA), a framework that enables post-training dynamic adaptation of static MoE models without fine-tuning. ZEDA integrates parameter-free zero-output experts, a two-stage self-distillation process, group-level load-balancing loss, and a dynamic expert-skipping mechanism. Evaluated on Qwen3-30B-A3B and GLM-4.7-Flash, ZEDA skips over 50% of expert FLOPs on average with minimal accuracy degradation, outperforming the strongest dynamic MoE baseline by 6.1 and 4.0 points, respectively, while achieving approximately 1.2× end-to-end inference speedup.

📝 Abstract

Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20$\times$ end-to-end inference speedup.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

dynamic MoE

post-trained adaptation

inference efficiency

expert skipping

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

Self-Distillation

Dynamic Sparsity