🤖 AI Summary
This study addresses the trade-off between model performance and training cost in large language model (LLM) development to maximize profit. By integrating scaling laws with microeconomic theory, it establishes the first rational decision-making framework for LLM training, distinguishing between compute-constrained and data-constrained regimes and deriving optimal strategies for model scale and training budget allocation. The analysis reveals that under compute constraints, the optimal training cost grows nearly linearly with hardware efficiency (FLOPs/$), while total cost scales sub-quadratically; under data constraints, optimal expenditure scales quadratically with available data volume and inversely with hardware efficiency. The findings indicate that current industry practices only partially align with theoretical optima, offering quantitative guidance for more efficient LLM training.
📝 Abstract
Scaling LLMs requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital expenditure. While it is established that scaling up LLMs reliably increases model quality (quantified in terms of loss or downstream evaluations), it is unclear how these quality improvements translate to potential revenue, and whether revenue increases would offset costs of larger-scale training and inference. In this work, we develop an economic model for characterizing the rational behavior of an LLM training firm by combining scaling laws with microeconomic theory. Under our model of firm behavior, LLM quality can be increased with more parameters and training tokens, leading to more potential adoption by consumers, who each have a quality threshold for using the LLM. On the other hand, additional parameters and training tokens both incur additional costs. We analyze the profit maximization problem for this model under compute-bound and data-bound regimes. In the compute-bound regime, optimal model size and token budget track hardware efficiency $E$ (FLOPs/\$) at a near-linear rate; total training cost then scales sub-quadratically in $E$. Data efficiency improvements incentivize larger models and training expenditure. When we are limited to $D$ data, profit-optimal training expenditure scales as $D^2/E$, i.e, increase with data and decreases with hardware efficiency (as well as data efficiency). Finally, we analyze practical trends in training expenditure: current trends are consistent with our most permissive model variants in the compute-bound regime, but are not profit-optimal in the data-bound regime or assuming hardware advances will stall. Overall, our results provide a theory of profit-optimal LLM training, providing a foundation for engaging critically with industry statements and supporting long-term economic decision making.