🤖 AI Summary
This work addresses the deployment challenges of sparse mixture-of-experts (SMoE) models, which are constrained by memory and throughput due to the need to load all experts. While existing pruning methods typically apply uniform sparsity allocation across layers, they overlook its critical impact on performance. The authors decouple expert pruning into intra-layer expert ranking and inter-layer sparsity budget allocation, proposing ESAP—an efficient proxy metric inspired by speculative decoding—and integrating it into EvoESAP, an evolutionary search framework that enables non-uniform inter-layer sparsity optimization. The approach is compatible with various intra-layer pruning criteria (e.g., Frequency, EAN) and, on 7B–30B SMoE models at 25%–50% sparsity, significantly improves open-ended generation performance (up to +19.6% on MATH-500) while maintaining competitive multiple-choice accuracy.
📝 Abstract
Sparse Mixture-of-Experts (SMoE) language models achieve strong capability at low per-token compute, yet deployment remains memory- and throughput-bound because the full expert pool must be stored and served. Post-training expert pruning reduces this cost, but most methods focus on which experts to prune within each layer and default to a uniform layer-wise sparsity allocation, even though the allocation can strongly affect performance. We decouple pruning into within-layer expert ranking and across-layer budget allocation, and introduce \textbf{E}xpected \textbf{S}peculative \textbf{A}cceptance \textbf{P}roxy (\textbf{ESAP}), a speculative-decoding-inspired, teacher-forced metric that measures how well a pruned model matches the full model. ESAP is bounded and stable, enabling cheap comparison of many candidates without costly autoregressive decoding. Building on ESAP, we propose EvoESAP, an evolutionary searching framework that optimizes a non-uniform layer-wise sparsity allocation under a fixed global budget while holding the within-layer pruning order fixed, making it a plug-and-play method with criteria such as Frequency, EAN, SEER, and REAP. Across 7B--30B SMoE LLMs at 25\% and 50\% sparsity, EvoESAP consistently discovers non-uniform allocations that improve open-ended generation (up to \textbf{+19.6\%} on MATH-500 at 50\% sparsity) while preserving competitive multiple-choice accuracy compared with uniform pruning at the same sparsity.