Aioli: A Unified Optimization Framework for Language Model Data Mixing

📅 2024-11-08
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of unstable performance in large language model (LLM) pretraining when mixing multi-domain data (e.g., legal, code, mathematics), where manually optimizing domain mixture ratios is difficult and error-prone. We propose Aioli, a unified online optimization framework. Its core innovation lies in the first identification of performance fluctuations as stemming from bias in estimating mixture law parameters, and the design of an implicit mixture law modeling mechanism coupled with online gradient estimation to enable adaptive, dynamic adjustment of domain mixture ratios. Evaluated on six benchmark datasets, Aioli reduces average test perplexity by 0.27 under full-scale training. In the more practical short-pretraining + long-finetuning setting, it achieves up to a 12.012-point perplexity reduction—substantially outperforming hierarchical sampling and state-of-the-art mixture strategies.

Technology Category

Application Category

📝 Abstract
Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity. To understand this inconsistency, we unify existing methods into a standard framework, showing they are equivalent to solving a common optimization problem: minimize average loss subject to a method-specific mixing law -- an implicit assumption on the relationship between loss and mixture proportions. This framework suggests that measuring the fidelity of a method's mixing law can offer insights into its performance. Empirically, we find that existing methods set their mixing law parameters inaccurately, resulting in the inconsistent mixing performance we observe. Using this insight, we derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions. Aioli outperforms stratified sampling on 6 out of 6 datasets by an average of 0.27 test perplexity points, whereas existing methods fail to consistently beat stratified sampling, doing up to 6.9 points worse. Moreover, in a practical setting where proportions are learned on shorter runs due to computational constraints, Aioli can dynamically adjust these proportions over the full training run, consistently improving performance over existing methods by up to 12.012 test perplexity points.
Problem

Research questions and friction points this paper is trying to address.

Optimizing data group mixtures for language model training
Evaluating fidelity of mixing laws in existing methods
Developing Aioli for dynamic proportion adjustment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified optimization framework for data mixing
Dynamic adjustment of mixture proportions online
Direct estimation of mixing law parameters
M
Mayee F. Chen
Department of Computer Science, Stanford University
M
Michael Y. Hu
Center for Data Science, New York University
N
Nicholas Lourie
Kyunghyun Cho
Kyunghyun Cho
New York University, Genentech
Machine LearningDeep Learning
Christopher Ré
Christopher Ré
Computer Science, Stanford University
machine learningartificial intelligencemachine learning systemsdata managementAI systems