Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

📅 2023-12-31
🏛️ International Conference on Machine Learning
📈 Citations: 60
Influential: 2
📄 PDF
🤖 AI Summary
Classical LLM scaling laws (e.g., Chinchilla) neglect inference cost, leading to suboptimal joint training-deployment optimization. Method: The authors introduce the first scaling law explicitly incorporating inference request volume, formulating a joint objective that jointly optimizes training and inference costs under total compute budget constraints; they derive optimal model size and pretraining dataset size via theoretical modeling, 47 controlled-scale empirical trainings, and ablation studies on coefficient fitting. Results: Empirical results challenge Chinchilla’s empirical assumption—performance continues improving at high token-to-parameter ratios (up to 10⁴). “Smaller models with longer training” outperform larger ones in low-inference-demand regimes (e.g., ~1B tokens). Predictions align closely with measurements, and Chinchilla’s overestimation of extreme-scale gains is attributed to distributional bias in its training data.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) scaling laws are empirical formulas that estimate changes in model quality as a result of increasing parameter count and training data. However, these formulas, including the popular Deepmind Chinchilla scaling laws, neglect to include the cost of inference. We modify the Chinchilla scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand. We conduct our analysis both in terms of a compute budget and real-world costs and find that LLM researchers expecting reasonably large inference demand (~1B requests) should train models smaller and longer than Chinchilla-optimal. Furthermore, we train 47 models of varying sizes and parameter counts to validate our formula and find that model quality continues to improve as we scale tokens per parameter to extreme ranges (up to 10,000). Finally, we ablate the procedure used to fit the Chinchilla scaling law coefficients and find that developing scaling laws only from data collected at typical token/parameter ratios overestimates the impact of additional tokens at these extreme ranges.
Problem

Research questions and friction points this paper is trying to address.

Incorporates inference cost into LLM scaling laws
Determines optimal model size and data for deployment
Validates scaling with extreme token-parameter ratios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modify Chinchilla laws to include inference costs
Train smaller models longer for high inference demand
Validate scaling with extreme token per parameter
🔎 Similar Papers