4+3 Phases of Compute-Optimal Neural Scaling Laws

๐Ÿ“… 2024-05-23
๐Ÿ›๏ธ Neural Information Processing Systems
๐Ÿ“ˆ Citations: 12
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the problem of determining the optimal neural network parameter count under compute constraints and infinite data. We propose an analytically tractable neural scaling model that characterizes the tripartite relationship among data complexity, target complexity, and model size. We rigorously establish the existence of four primary and three secondary computational scaling regimes, where phase boundaries are governed by the relative dominance of model capacity, optimizer-induced noise, and feature embedding geometry. Leveraging mean-square loss and single-step SGD, we derive closed-form expressions for the full training loss trajectory. Combining theoretical analysis with large-scale numerical experiments, we precisely quantify the scaling exponents for each regime and provide explicit closed-form formulas for the optimal parameter count as a function of floating-point operation budgetโ€”thereby substantially improving predictive accuracy for large-model training efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
We consider the solvable neural scaling model with three parameters: data complexity, target complexity, and model-parameter-count. We use this neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent on a mean-squared loss. We derive a representation of the loss curves which holds over all iteration counts and improves in accuracy as the model parameter count grows. We then analyze the compute-optimal model-parameter-count, and identify 4 phases (+3 subphases) in the data-complexity/target-complexity phase-plane. The phase boundaries are determined by the relative importance of model capacity, optimizer noise, and embedding of the features. We furthermore derive, with mathematical proof and extensive numerical evidence, the scaling-law exponents in all of these phases, in particular computing the optimal model-parameter-count as a function of floating point operation budget.
Problem

Research questions and friction points this paper is trying to address.

Derives compute-limited scaling laws for neural models
Identifies 4+3 phases in data-target complexity plane
Computes optimal model size for given FLOP budget
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses one-pass SGD for neural scaling model training
Derives loss curves representation for all iterations
Identifies 4+3 phases in compute-optimal scaling
๐Ÿ”Ž Similar Papers
No similar papers found.