Prescriptive Scaling Laws for Data Constrained Training

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work addresses the challenge of efficiently allocating compute resources to enhance model performance under constraints of limited high-quality training data. The authors propose a novel scaling law that, for the first time, models overfitting induced by data repetition as a single quantifiable coefficient and incorporates an additive penalty term, thereby overcoming limitations of the Chinchilla scaling law in such settings. Through empirical scaling analysis, quantitative characterization of overfitting, and evaluation of weight decay effects, the study reveals a performance inflection point associated with repeated training. Experimental results demonstrate that training strategies guided by this new scaling law significantly improve model performance, and strong weight decay (λ=1.0) reduces the overfitting coefficient by approximately 70%.

📝 Abstract

Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adopted Chinchilla scaling law assumes every training token is unique. This limits its ability to guide pretraining decisions in data-constrained regimes. We model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior. Our scaling law yields qualitatively new compute-optimal allocation advice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law's recommended configuration improves performance in data-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strong weight decay ($λ=1.0$) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimal weight decay in data-constrained regimes is an order of magnitude larger than standard practice.

Problem

Research questions and friction points this paper is trying to address.

data-constrained training

scaling laws

overfitting

compute allocation

pretraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

scaling laws

data-constrained training

overfitting penalty