The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

πŸ“… 2025-01-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
The prohibitively high computational cost of large language model (LLM) pretraining hinders scalability and accessibility. Method: This paper systematically investigates dynamic parameter pruning for sparse pretraining, proposing a progressive pruning schedule that applies sparsification dynamically over 25%–75% of training steps. Contributions/Results: First, it identifies the critical impact of pruning timing on final model performanceβ€”a previously uncharacterized factor. Second, it introduces a novel scaling law parameterized by the *average parameter count* across training, unifying performance modeling for both sparse and dense pretraining. Under the Chinchilla framework, theoretical analysis and empirical evaluation demonstrate that, under identical compute budgets, the method achieves evaluation loss comparable to dense training while reducing model size by 2–4Γ— and inference FLOPs by >50%, preserving strong generalization.

Technology Category

Application Category

πŸ“ Abstract
Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large language models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into a single phase--provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. These findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training. Through empirical and theoretical validation, we demonstrate that this modified scaling law accurately models evaluation loss for both sparsely and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms. Our findings indicate that while sparse pre-training achieves the same final model quality as dense pre-training for equivalent compute budgets, it provides substantial benefits through reduced model size, enabling significant potential computational savings during inference.
Problem

Research questions and friction points this paper is trying to address.

Large Language Model
Parameter Pruning
Efficient Pre-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pruning Strategy
Sparse Pre-training
Improved Weasel Rule
πŸ”Ž Similar Papers
No similar papers found.