AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs

📅 2024-07-29

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

To address the failure of fixed data mixing ratios in large language model (LLM) pretraining as model scale increases, this paper proposes a cross-scale data mixture optimization framework. First, a parameterized loss prediction model is trained under small-scale computational budgets to estimate domain-wise importance. Leveraging scale evolution theory, we empirically discover—and formally characterize—for the first time how domain importance dynamically shifts with training scale. Based on this insight, we design an automatic reweighting and extrapolation algorithm that adapts data proportions without additional training. In GPT-2 Large pretraining, our method accelerates perplexity reduction by 28% over the baseline and up to 38% over uniform mixing, while achieving state-of-the-art average downstream task performance. The core contributions are: (i) establishing a theoretical foundation for the scale-dependent evolution of data domain importance, and (ii) introducing a paradigm wherein optimal data mixing is learnable at small scale and reliably extrapolatable to large scale.

Technology Category

Application Category

📝 Abstract

Domain reweighting is an emerging research area aimed at adjusting the relative weights of different data sources to improve the effectiveness and efficiency of LLM pre-training. We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales, challenging the existing practice of determining competitive mixtures in small-scale experiments and directly applying them at much larger scales. To address this, we propose AutoScale, a two-stage, scale-aware data composition framework. First, AutoScale fits a parametric model that predicts the model's loss under different data compositions, then uses it to find an approximate best allocation at smaller, more manageable budgets. Next, leveraging a novel theoretical analysis of how optimal compositions evolve with scale, AutoScale extrapolates that composition to larger budgets without further retraining. Empirically, AutoScale accelerates convergence and improves downstream performance. For instance, when pre-training GPT-2 Large, it achieves a 28% faster perplexity reduction than baselines and up to a 38% speed-up over unweighted training, while yielding best-average results on various downstream tasks. Overall, our findings illustrate how domain importance shifts with training scale, underscoring the need for scale-dependent data curation in LLM training. Our code is open-sourced.

Problem

Research questions and friction points this paper is trying to address.

Optimizing data mixtures for LLM pre-training across scales

Addressing performance drop when scaling small-tested data mixtures

Developing scale-aware framework for efficient data composition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parametric model predicts loss for data compositions

Extrapolates optimal compositions to larger scales

Accelerates convergence and improves downstream performance

🔎 Similar Papers

BiMix: A Bivariate Data Mixing Law for Language Model Pretraining