Test-Time Scaling Makes Overtraining Compute-Optimal

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses a critical limitation of existing scaling laws, which neglect the computational cost of inference-stage operations—such as repeated sampling—and thus fail to enable joint optimization of training and inference resource allocation under an end-to-end compute budget. The authors propose Train-to-Test (T²) scaling laws, the first framework to explicitly incorporate test-time scaling into the analysis. By jointly optimizing model size, training data volume, and inference sampling count under a fixed total compute budget, T² achieves superior performance. Leveraging a pass@$k$-based formulation that accounts for both task loss and accuracy, empirical validation across eight downstream tasks and post-training scenarios demonstrates that optimal strategies, when accounting for inference costs, significantly favor overtraining. Models guided by T² consistently outperform those designed with conventional scaling approaches, highlighting its relevance for deploying state-of-the-art large language models.

Technology Category

Application Category

📝 Abstract

Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test ($T^2$) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. $T^2$ modernizes pretraining scaling laws with pass@$k$ modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from $T^2$ are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that $T^2$ scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making $T^2$ scaling meaningful in modern deployments.

Problem

Research questions and friction points this paper is trying to address.

test-time scaling

overtraining

scaling laws

inference cost

compute-optimal

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time scaling

overtraining

scaling laws