Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

📅 2023-12-31

🏛️ International Conference on Machine Learning

📈 Citations: 60

✨ Influential: 2

career value

192K/year

🤖 AI Summary

Classical LLM scaling laws (e.g., Chinchilla) neglect inference cost, leading to suboptimal joint training-deployment optimization. Method: The authors introduce the first scaling law explicitly incorporating inference request volume, formulating a joint objective that jointly optimizes training and inference costs under total compute budget constraints; they derive optimal model size and pretraining dataset size via theoretical modeling, 47 controlled-scale empirical trainings, and ablation studies on coefficient fitting. Results: Empirical results challenge Chinchilla’s empirical assumption—performance continues improving at high token-to-parameter ratios (up to 10⁴). “Smaller models with longer training” outperform larger ones in low-inference-demand regimes (e.g., ~1B tokens). Predictions align closely with measurements, and Chinchilla’s overestimation of extreme-scale gains is attributed to distributional bias in its training data.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) scaling laws are empirical formulas that estimate changes in model quality as a result of increasing parameter count and training data. However, these formulas, including the popular Deepmind Chinchilla scaling laws, neglect to include the cost of inference. We modify the Chinchilla scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand. We conduct our analysis both in terms of a compute budget and real-world costs and find that LLM researchers expecting reasonably large inference demand (~1B requests) should train models smaller and longer than Chinchilla-optimal. Furthermore, we train 47 models of varying sizes and parameter counts to validate our formula and find that model quality continues to improve as we scale tokens per parameter to extreme ranges (up to 10,000). Finally, we ablate the procedure used to fit the Chinchilla scaling law coefficients and find that developing scaling laws only from data collected at typical token/parameter ratios overestimates the impact of additional tokens at these extreme ranges.

Problem

Research questions and friction points this paper is trying to address.

Incorporates inference cost into LLM scaling laws

Determines optimal model size and data for deployment

Validates scaling with extreme token-parameter ratios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modify Chinchilla laws to include inference costs

Train smaller models longer for high inference demand

Validate scaling with extreme token per parameter

🔎 Similar Papers

Resolving Discrepancies in Compute-Optimal Scaling of Language Models