Prescriptive Scaling Reveals the Evolution of Language Model Capabilities

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study addresses how to reliably predict the upper-bound performance of language models on downstream tasks based on pretraining compute budgets and assess the temporal stability of this mapping. Leveraging large-scale empirical observations, the work employs smooth quantile regression to model the monotonic, saturating S-shaped relationship between pretraining FLOPs and high-percentile downstream performance, establishing the first stable and predictable performance frontier. Key contributions include uncovering the continually advancing frontier of mathematical reasoning capabilities, proposing an efficient algorithm that approximates the full performance frontier using only 20% of the evaluation budget, and releasing the Proteus-2k benchmark dataset. Empirical validation demonstrates strong temporal stability of capability frontiers across most tasks—except mathematical reasoning—substantially reducing evaluation costs.

Technology Category

Application Category

📝 Abstract

For deploying foundation models, practitioners increasingly need prescriptive scaling laws: given a pre training compute budget, what downstream accuracy is attainable with contemporary post training practice, and how stable is that mapping as the field evolves? Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via smoothed quantile regression with a monotone, saturating sigmoid parameterization. We validate the temporal reliability by fitting on earlier model generations and evaluating on later releases. Across various tasks, the estimated boundaries are mostly stable, with the exception of math reasoning that exhibits a consistently advancing boundary over time. We then extend our approach to analyze task dependent saturation and to probe contamination related shifts on math reasoning tasks. Finally, we introduce an efficient algorithm that recovers near full data frontiers using roughly 20% of evaluation budget. Together, our work releases the Proteus 2k, the latest model performance evaluation dataset, and introduces a practical methodology for translating compute budgets into reliable performance expectations and for monitoring when capability boundaries shift across time.

Problem

Research questions and friction points this paper is trying to address.

prescriptive scaling

foundation models

compute budget

capability boundaries

temporal reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

prescriptive scaling

quantile regression

capability boundaries