AIMER: Calibration-Free Task-Agnostic MoE Pruning

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the limitations of existing task-agnostic expert pruning methods for Mixture-of-Experts (MoE) models, which rely on calibration sets to estimate expert importance—leading to sensitivity in results and substantial preprocessing overhead. To overcome this, the authors propose a calibration-free scoring criterion that ranks experts within each layer based on the ratio of absolute mean to root mean square (RMS). This enables efficient and stable intra-layer expert differentiation and hierarchical pruning. The method scales effectively across MoE models ranging from 7B to 30B parameters, with scoring times as low as 0.22–1.27 seconds. Evaluated on 16 benchmarks, it achieves competitive or superior performance compared to calibration-dependent approaches at pruning rates of 25%–50%, significantly enhancing both pruning efficiency and generalization capability.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) language models increase parameter capacity without proportional per-token compute, but the deployment still requires storing all experts, making expert pruning important for reducing memory and serving overhead. Existing task-agnostic expert pruning methods are typically calibration-dependent: they estimate expert importance from routing or activation statistics on a calibration set, which makes pruning outcomes sensitive to the choice of calibration set and adds substantial preprocessing cost. We introduce AIMER (\textbf{A}bsolute mean over root mean square \textbf{IM}portance for \textbf{E}xpert \textbf{R}anking), a simple calibration-free criterion that yields clear within-layer score separation and distinct expert stratification. Across 7B to 30B MoE language models at 25\% and 50\% pruning ratios over 16 benchmarks, AIMER consistently delivers competitive or stronger overall performance against state-of-the-art calibration-based expert pruning baselines with only 0.22--1.27 seconds for scoring the experts.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

expert pruning

calibration-free

task-agnostic

model compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

calibration-free

Mixture-of-Experts

expert pruning