Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

📅 2025-01-21

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This study investigates the trade-off between sparsity and performance/efficiency in sparse Mixture-of-Experts (MoE) language models. We systematically sweep sparsity levels across multiple model scales and computational budgets, employing controlled ablation experiments and multi-stage evaluation—including pretraining loss, zero-shot generalization, and fine-tuning performance. Our key finding is the existence of a hardware- and scale-dependent optimal sparsity level—a phenomenon we formally integrate into the scaling law framework. Crucially, we reveal that sparsity exerts a non-monotonic effect on model behavior, and we propose principled guidelines for selecting sparsity under given resource constraints. Empirically, adopting the optimal sparsity yields 1.3–1.8× higher training throughput and improves average downstream task accuracy by 2.1–4.7 percentage points—under fixed parameter count or compute budget. These results provide both theoretical foundations and practical prescriptions for designing efficient large-scale MoE models.

Technology Category

Application Category

📝 Abstract

Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Expert models (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the ratio of non-active to total parameters, affects model performance in terms of both pretraining and downstream performance. We find that under different constraints (e.g. parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.

Problem

Research questions and friction points this paper is trying to address.

Sparse Mixture of Experts (MoE)

Language Model Performance

Training Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Mixture of Experts (MoE)

Optimal Sparsity Level

Training Efficiency and Performance

🔎 Similar Papers

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts