Generalizing Scaling Laws for Dense and Sparse Large Language Models

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing scaling laws for large language models (LLMs) lack a unified framework applicable to both dense and sparse architectures, limiting generalizability across model families. Method: We propose the first generalized scaling law framework that jointly models parameter count, dataset size, compute budget, and sparsity metrics—including number of experts and routing ratio—via a learnable, architecture-agnostic functional form. Building upon Chinchilla-style scaling theory, we incorporate sparsity-aware variables and dynamic training constraints to ensure consistency across diverse configurations. Results: Evaluated on dense models (LLaMA, Pythia) and sparse models (Mixtral, DeepSpeed-MoE), our framework reduces validation loss prediction error by 32% on average. It significantly improves accuracy in training resource allocation and enables reliable pre-training performance forecasting across heterogeneous architectures, thereby enhancing scalability planning for next-generation LLMs.

Technology Category

Application Category

📝 Abstract
Over the past few years, the size of language models has grown exponentially, as has the computational cost to train these large models. This rapid growth has motivated researchers to develop new techniques aimed at enhancing the efficiency of the training process. Despite these advancements, optimally predicting the model size or allocating optimal resources remains a challenge. Several efforts have addressed the challenge by proposing different scaling laws, but almost all of them are architecture-specific (dense or sparse). In this work we revisit existing scaling laws and propose a generalized scaling law to provide a unified framework that is applicable to both dense and sparse large language models. We evaluate and compare our proposed scaling law with existing scaling laws to demonstrate its effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Generalizing scaling laws for dense and sparse LLMs
Optimizing model size and resource allocation
Unified framework for diverse architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized scaling law for diverse models
Unified framework for dense and sparse LLMs
Comparative evaluation with existing scaling laws
🔎 Similar Papers
No similar papers found.