Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing Top-k routing ignores token-level difficulty variations, while fixed Top-p routing suffers from inflexible computational cost control and hyperparameter sensitivity. To address these issues, this paper proposes a dynamic Top-p routing mechanism for demand-driven, hierarchical, and target-sparsity-configurable expert activation in Mixture-of-Experts (MoE) models. Our key contributions are: (1) the first probabilistic threshold adaptation scheme based on a Proportional-Integral (PI) controller, enabling fine-grained, cross-token, cross-layer, and cross-model-scale sparsity control with high precision; and (2) a layer-aware logits normalization method to enhance routing stability. Extensive experiments on large language models (LLMs) and Diffusion Transformers demonstrate that our method consistently outperforms Top-k and fixed Top-p baselines while strictly maintaining the target sparsity. Moreover, it exhibits strong robustness and scalability under increasing numbers of experts, model sizes, and training data volumes.

Technology Category

Application Category

📝 Abstract

Sparse Mixture-of-Experts (MoE) architectures effectively scale model capacity by activating only a subset of experts for each input token. However, the standard Top-k routing strategy imposes a uniform sparsity pattern that ignores the varying difficulty of tokens. While Top-p routing offers a flexible alternative, existing implementations typically rely on a fixed global probability threshold, which results in uncontrolled computational costs and sensitivity to hyperparameter selection. In this paper, we propose DTop-p MoE, a sparsity-controllable dynamic Top-p routing mechanism. To resolve the challenge of optimizing a non-differentiable threshold, we utilize a Proportional-Integral (PI) Controller that dynamically adjusts the probability threshold to align the running activated-expert sparsity with a specified target. Furthermore, we introduce a dynamic routing normalization mechanism that adapts layer-wise routing logits, allowing different layers to learn distinct expert-selection patterns while utilizing a global probability threshold. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that DTop-p consistently outperforms both Top-k and fixed-threshold Top-p baselines. Our analysis confirms that DTop-p maintains precise control over the number of activated experts while adaptively allocating resources across different tokens and layers. Furthermore, DTop-p exhibits strong scaling properties with respect to expert granularity, expert capacity, model size, and dataset size, offering a robust framework for large-scale MoE pre-training.

Problem

Research questions and friction points this paper is trying to address.

Dynamic Top-p routing for adaptive expert activation

Control computational costs with sparsity-controllable mechanism

Optimize non-differentiable threshold using PI Controller

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Top-p routing with PI controller for sparsity control

Dynamic routing normalization adapts layer-wise expert selection

Sparsity-controllable MoE framework for scalable foundation model pre-training

🔎 Similar Papers

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models