Establishing Task Scaling Laws via Compute-Efficient Model Ladders

📅 2024-12-05

🏛️ arXiv.org

📈 Citations: 8

✨ Influential: 1

career value

213K/year

🤖 AI Summary

Standard language modeling loss fails to accurately predict downstream task performance for pretrained language models under overtraining regimes. Method: This paper proposes a task-level two-stage scaling framework: first predicting task-specific loss as a function of model and data scale, then mapping loss to task accuracy. It introduces a lightweight “model ladder” strategy—requiring only 1% of the target model’s compute—to efficiently fit multi-task performance trends. Contribution/Results: The framework achieves precise performance forecasting for four downstream multiple-choice tasks, with mean absolute error ≤2 percentage points. Experiments demonstrate a strong correlation between the number of ladder models and prediction accuracy, significantly outperforming single-step power-law baselines. This work overcomes the fundamental limitation of conventional loss-based scaling laws, which cannot characterize task-specific behavior.

Technology Category

Application Category

📝 Abstract

We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performance. Therefore, we leverage a two-step prediction approach: first use model and data size to predict a task-specific loss, and then use this task loss to predict task performance. We train a set of small-scale"ladder"models, collect data points to fit the parameterized functions of the two prediction steps, and make predictions for two target models: a 7B model trained to 4T tokens and a 13B model trained to 5T tokens. Training the ladder models only costs 1% of the compute used for the target models. On four multiple-choice tasks written in ranked classification format, we can predict the accuracy of both target models within 2 points of absolute error. We have higher prediction error on four other tasks (average absolute error 6.9) and find that these are often tasks with higher variance in task metrics. We also find that using less compute to train fewer ladder models tends to deteriorate predictions. Finally, we empirically show that our design choices and the two-step approach lead to superior performance in establishing scaling laws.

Problem

Research questions and friction points this paper is trying to address.

Predicting task performance of overtrained language models

Developing compute-efficient scaling laws via ladder models

Accurately forecasting accuracy with minimal training cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-step prediction approach for task performance

Small-scale ladder models for compute efficiency

Parameterized functions to predict model accuracy

🔎 Similar Papers

No similar papers found.