Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This study investigates the differential impact of sparsity on memorization versus reasoning capabilities in Mixture-of-Experts (MoE) language models, addressing the counterintuitive phenomenon where reasoning performance saturates or even degrades with increasing parameter count. Under a fixed computational budget, we systematically vary total parameters, activated parameters, and top-k routing sparsity to disentangle their contributions to pretraining loss and downstream task performance. Results show that memorization consistently improves with scale, whereas reasoning deteriorates markedly under high sparsity—defects unmitigated by reinforcement learning fine-tuning or test-time compute scaling. Critically, activated parameter count proves more decisive than total parameter count for reasoning. We propose an attribution analysis framework built upon MoE Transformers with controlled-variable training, accompanied by open-sourced code and comprehensive logs. This work provides both theoretical insights and practical guidelines for MoE architecture design.

Technology Category

Application Category

📝 Abstract

Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization and reasoning. We train families of MoE Transformers that systematically vary total parameters, active parameters, and top-$k$ routing while holding the compute budget fixed. For every model we record pre-training loss, downstream task loss, and task accuracy, allowing us to separate the train-test generalization gap from the loss-accuracy gap. Memorization benchmarks improve monotonically with total parameters, mirroring training loss. By contrast, reasoning performance saturates and can even regress despite continued gains in both total parameters and training loss. Altering top-$k$ alone has little effect when active parameters are constant, and classic hyperparameters such as learning rate and initialization modulate the generalization gap in the same direction as sparsity. Neither post-training reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning deficit of overly sparse models. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.

Problem

Research questions and friction points this paper is trying to address.

How MoE sparsity affects memorization versus reasoning capabilities

Determining optimal sparsity configuration for reasoning performance

Investigating generalization gaps in sparse MoE language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

MoE sparsity optimization for reasoning tasks

Systematic variation of parameters under fixed compute

Analysis of generalization and accuracy gaps

🔎 Similar Papers

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts