Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work challenges the prevailing consensus in LLM scaling research that downstream performance is unreliable to predict. Methodologically, it introduces a power-law framework that directly models the logarithm of downstream task accuracy as a function of training budget (tokens and parameters). It empirically validates— for the first time—that standard downstream metrics (e.g., MMLU, ARC) obey clean power-law scaling. The proposed functional form is unified and generalizes across varying token-to-parameter ratios and inference costs, circumventing error propagation inherent in conventional two-stage prediction (pretraining loss → downstream performance). Evaluated on a large-scale training trajectory (17B parameters, 350B tokens), the model achieves high-accuracy extrapolation across diverse data mixtures. All pretraining loss curves and downstream evaluation results are publicly released, enabling reproducible scaling law research.

Technology Category

Application Category

📝 Abstract

While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a fixed token-to-parameter ratio, a simple power law can accurately describe the scaling behavior of log accuracy on multiple popular downstream tasks. Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure, which is prone to compounding errors. Furthermore, we introduce functional forms that predict accuracy across token-to-parameter ratios and account for inference compute under repeated sampling. We validate our findings on models with up to 17B parameters trained on up to 350B tokens across two dataset mixtures. To support reproducibility and encourage future research, we release the complete set of pretraining losses and downstream evaluation results.

Problem

Research questions and friction points this paper is trying to address.

Directly models downstream task performance scaling from training budget

Predicts accuracy across token-to-parameter ratios and inference compute

Validates scaling laws on large models with up to 17B parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct power law models benchmark performance scaling

Functional forms predict accuracy across token-parameter ratios

Validated on models up to 17B parameters and 350B tokens

🔎 Similar Papers

No similar papers found.