Pretraining Scaling Laws for Generative Evaluations of Language Models

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior scaling laws for language models focus predominantly on discriminative evaluation (e.g., multiple-choice accuracy), leaving generative tasks—such as mathematical reasoning and code generation—poorly characterized. Method: This work introduces the first systematic scaling laws for generative performance, specifically modeling pass@k across three orthogonal axes: compute, training tokens, and model parameters (“compute scaling law”), and a novel “reference likelihood scaling law” incorporating log-likelihood under gold reference solutions. A tunable hyperparameter *k* governs extrapolation behavior, revealing complementary stability regimes across *k*. Results: All three laws exhibit robustness over approximately five orders of magnitude in *k*; the reference likelihood law degrades slightly at large *k*, while the compute law is weaker at small *k*. The framework establishes a theoretical link between generative scaling laws and computational optimality, offering a new paradigm for capability assessment and resource allocation in generative AI.

Technology Category

Application Category

📝 Abstract
Neural scaling laws have played a central role in modern machine learning, driving the field's ever-expanding scaling of parameters, data and compute. While much research has gone into fitting scaling laws and predicting performance on pretraining losses and on discriminative evaluations such as multiple-choice question-answering, comparatively little research has been done on fitting scaling laws and predicting performance on generative evaluations such as mathematical problem-solving or software engineering. We propose and evaluate three different pretraining scaling laws for fitting pass-at-$k$ on generative evaluations and for predicting pass-at-$k$ of the most expensive model using the performance of cheaper models. Our three scaling laws differ in the covariates used: (1) compute, (2) model parameters and tokens, (3) log likelihoods of gold reference solutions. We make four main contributions: (1) We show how generative evaluations offer new hyperparameters (in our setting, $k$) that researchers can use to control the scaling laws parameters and the predictability of performance. (2) In terms of scaling law parameters, we find that the compute scaling law and parameters,+,tokens scaling law stabilize for the last ~$1.5{-}2.5$ orders of magnitude, whereas the gold reference likelihood scaling law stabilizes for the last ~$5$ orders of magnitude. (3) In terms of predictive performance, we find all three scaling laws perform comparably, although the compute scaling law predicts slightly worse for small $k$ and the log likelihoods of gold reference solutions predicts slightly worse for large $k$. (4) We establish a theoretical connection that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens scaling law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance.
Problem

Research questions and friction points this paper is trying to address.

Developing scaling laws for generative evaluations of language models
Predicting performance on tasks like mathematical problem-solving and software engineering
Evaluating three scaling law approaches using compute, parameters, and likelihoods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed three pretraining scaling laws for generative evaluations
Used compute, model parameters, and tokens as covariates
Applied gold reference likelihoods to predict model performance
🔎 Similar Papers
No similar papers found.