Improving MoE Compute Efficiency by Composing Weight and Data Sparsity

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical inconsistency in autoregressive models employing token-choice Mixture-of-Experts (MoE): while traditional token-choice MoE introduces data sparsity, it violates causality, leading to a mismatch between training and inference; conversely, weight-sparsity-only MoE struggles to utilize computational resources efficiently. To resolve this, the authors propose incorporating a zero-computation (null) expert into a causally compliant token-choice MoE framework, enabling safe data sparsity without compromising causal integrity. By leveraging a standard load-balancing objective, the model implicitly achieves modality-aware computation allocation—uniformly distributing tokens across real and null experts in expectation—without requiring explicit routing decisions. Experiments in vision–language pretraining demonstrate that, under identical expected FLOPs, this approach significantly reduces training loss, improves downstream performance, and automatically routes visual tokens more aggressively to the null expert.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts layers achieve compute efficiency through weight sparsity: each token activates only a subset of experts. Data sparsity, where each expert processes only a subset of tokens, offers a complementary axis. Expert-choice routing implements data sparsity directly but violates causality in autoregressive models, creating train-inference mismatch. We recover data sparsity within causal token-choice MoE by leveraging zero-compute (null) experts within the routing pool. When a token routes to null experts, those slots consume no compute. The standard load balancing objective trains the model to uniformly use all experts (real and null) therefore creating data sparsity in expectation without the causality violations. We evaluate on vision-language model training, where data heterogeneity is pronounced: vision encoders produce many low-information tokens while text tokens are denser. At matched expected FLOPs, composing weight and data sparsity yields a more compute-efficient frontier than weight sparsity alone, with gains in training loss and downstream performance. The model learns implicit modality-aware allocation, routing vision tokens to null experts more aggressively than text, without explicit modality routing.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
compute efficiency
data sparsity
causality
weight sparsity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
data sparsity
null experts
causal routing
compute efficiency